Course Code: pysparkmlgsg
Duration: 21 hours
Prerequisites:
Participants should have the following background:
Basic Python programming knowledge including working with functions, data structures and libraries
Fundamental understanding of data analysis concepts such as datasets, transformations and aggregations
Basic knowledge of SQL and relational data concepts
Introductory understanding of Machine Learning concepts such as training datasets, features and evaluation metrics
Familiarity with command line environments and basic software development practices is recommended
Experience with Pandas, NumPy or similar data processing libraries is helpful but not mandatory.
Overview:
This training provides a practical introduction to building scalable data processing and Machine Learning workflows using PySpark. Participants learn how Apache Spark operates within modern Big Data ecosystems and how to efficiently process large datasets using distributed computing principles.
The course gradually moves from Spark architecture and DataFrame operations toward advanced topics such as feature engineering, Machine Learning model training and building end to end ML pipelines using Spark MLlib. Participants will also explore performance optimisation techniques, model evaluation strategies and enterprise practices for deploying Machine Learning workflows at scale.
Through practical exercises and real world inspired scenarios, participants will learn how to design efficient data pipelines, prepare datasets for Machine Learning and build distributed ML models capable of handling large volumes of data commonly found in enterprise environments.
By the end of the training, participants will understand how to integrate PySpark into modern data platforms and apply scalable Machine Learning techniques in production oriented environments.
Course Outline:
PySpark & Machine Learning
Module 1: Big Data & Spark Foundations
- Overview of the Big Data ecosystem and the role of Spark in modern data platforms
- Understanding Spark architecture: driver, executors, cluster manager, lazy evaluation, DAG and execution planning
- Differences between RDD and DataFrame APIs and when to use each approach
- Creating and configuring SparkSession and understanding application configuration fundamentals
Module 2: PySpark DataFrames
- Reading and writing data from enterprise sources and formats (CSV, JSON, Parquet, Delta)
- Working with PySpark DataFrames: transformations, actions, column expressions, filtering, joins and aggregations
- Implementing advanced operations such as window functions, handling timestamps and working with nested data
- Applying data quality checks and writing reusable, maintainable PySpark code
Module 3: Processing Large Datasets Efficiently
- Understanding performance fundamentals: partitioning strategies, shuffle behaviour, caching and persistence
- Using optimisation techniques including broadcast joins and execution plan analysis
- Efficient processing of large datasets and best practices for scalable data workflows
- Understanding schema evolution and modern storage formats used in enterprise environments
Module 4: Feature Engineering at Scale
- Performing feature engineering with Spark MLlib: handling missing values, encoding categorical variables and feature scaling
- Designing reusable preprocessing steps and preparing datasets for Machine Learning pipelines
- Introduction to feature selection and handling imbalanced datasets
Module 5: Machine Learning with Spark MLlib
- Understanding MLlib architecture and the Estimator/Transformer pattern
- Training regression and classification models at scale (Linear Regression, Logistic Regression, Decision Trees, Random Forest)
- Comparing models and interpreting results in distributed Machine Learning workflows
Module 6: End-to-End ML Pipelines
- Building end-to-end Machine Learning pipelines combining preprocessing, feature engineering and modelling
- Applying train/validation/test split strategies
- Performing cross-validation and hyperparameter tuning using grid search and random search
- Structuring reproducible Machine Learning experiments
Module 7: Model Evaluation & Practical ML Decision Making
- Applying appropriate evaluation metrics for regression and classification problems
- Identifying overfitting and underfitting and making practical model selection decisions
- Interpreting feature importance and understanding model behaviour
Module 8: Production & Enterprise Practices
- Persisting and loading models in Spark
- Implementing batch inference workflows on large datasets
- Understanding the Machine Learning lifecycle in enterprise environments
- Introduction to versioning, experiment tracking concepts and basic testing strategies
Practical Outcome
- Ability to work autonomously with PySpark
- Ability to process large datasets efficiently
- Ability to perform feature engineering at scale
- Ability to build scalable Machine Learning pipelines
Overview in Category:
This training provides a practical introduction to building scalable data processing and Machine Learning workflows using PySpark. Participants learn how Apache Spark operates within modern Big Data ecosystems and how to efficiently process large datasets using distributed computing principles.
United Arab Emirates - PySpark and Machine Learning
Qatar - PySpark and Machine Learning
Egypt - PySpark and Machine Learning
Saudi Arabia - PySpark and Machine Learning
South Africa - PySpark and Machine Learning
Brasil - PySpark e Machine Learning
Canada - PySpark and Machine Learning
香港 - PySpark and Machine Learning
澳門 - PySpark and Machine Learning
USA - PySpark and Machine Learning
Österreich - PySpark und Machine Learning
Schweiz - PySpark und Machine Learning
Deutschland - PySpark und Machine Learning
Czech Republic - PySpark a strojové učení
Denmark - PySpark and Machine Learning
Estonia - PySpark and Machine Learning
Finland - PySpark and Machine Learning
Greece - PySpark και Μηχανική Μάθησης
Magyarország - PySpark és gépi tanulás
Ireland - PySpark and Machine Learning
Luxembourg - PySpark and Machine Learning
Latvia - PySpark and Machine Learning
España - PySpark y Aprendizaje Automático
Italia - PySpark e Machine Learning
Lithuania - PySpark and Machine Learning
Nederland - PySpark en Machine Learning
Norway - PySpark og Maskinlæring
Portugal - PySpark e Machine Learning
România - PySpark și Machine Learning
Sverige - PySpark och Maskininlärning
Türkiye - PySpark ve Makine Öğrenimi
Malta - PySpark and Machine Learning
Belgique - PySpark et Machine Learning
France - PySpark et Machine Learning
Australia - PySpark and Machine Learning
Malaysia - PySpark and Machine Learning
New Zealand - PySpark and Machine Learning
Philippines - PySpark and Machine Learning
Singapore - PySpark and Machine Learning
Thailand - PySpark and Machine Learning
India - PySpark and Machine Learning
Argentina - PySpark y Aprendizaje Automático
Chile - PySpark y Aprendizaje Automático
Costa Rica - PySpark y Aprendizaje Automático
Ecuador - PySpark y Aprendizaje Automático
Guatemala - PySpark y Aprendizaje Automático
Colombia - PySpark y Aprendizaje Automático
México - PySpark y Aprendizaje Automático
Panama - PySpark y Aprendizaje Automático
Peru - PySpark y Aprendizaje Automático
Uruguay - PySpark y Aprendizaje Automático
Venezuela - PySpark y Aprendizaje Automático
Polska - PySpark i Uczenie Maszynowe
United Kingdom - PySpark and Machine Learning
Pakistan - PySpark and Machine Learning
Sri Lanka - PySpark and Machine Learning
Bulgaria - PySpark и машинно обучение
Bolivia - PySpark y Aprendizaje Automático
Indonesia - PySpark and Machine Learning
Kazakhstan - PySpark and Machine Learning
Moldova - PySpark și Machine Learning
Morocco - PySpark and Machine Learning
Tunisia - PySpark and Machine Learning
Kuwait - PySpark and Machine Learning
Oman - PySpark and Machine Learning
Slovakia - PySpark and Machine Learning
Kenya - PySpark and Machine Learning
Nigeria - PySpark and Machine Learning
Botswana - PySpark and Machine Learning
Slovenia - PySpark and Machine Learning
Croatia - PySpark and Machine Learning
Serbia - PySpark and Machine Learning
Bhutan - PySpark and Machine Learning
Nepal - PySpark and Machine Learning