Course Code: pysparkmlgsg
Duration: 21 hours
Prerequisites:
-
Overview:
-
Course Outline:
PySpark & Machine Learning
Module 1: Big Data & Spark Foundations
- Overview of the Big Data ecosystem and the role of Spark in modern data platforms
- Understanding Spark architecture: driver, executors, cluster manager, lazy evaluation, DAG and execution planning
- Differences between RDD and DataFrame APIs and when to use each approach
- Creating and configuring SparkSession and understanding application configuration fundamentals
Module 2: PySpark DataFrames
- Reading and writing data from enterprise sources and formats (CSV, JSON, Parquet, Delta)
- Working with PySpark DataFrames: transformations, actions, column expressions, filtering, joins and aggregations
- Implementing advanced operations such as window functions, handling timestamps and working with nested data
- Applying data quality checks and writing reusable, maintainable PySpark code
Module 3: Processing Large Datasets Efficiently
- Understanding performance fundamentals: partitioning strategies, shuffle behaviour, caching and persistence
- Using optimisation techniques including broadcast joins and execution plan analysis
- Efficient processing of large datasets and best practices for scalable data workflows
- Understanding schema evolution and modern storage formats used in enterprise environments
Module 4: Feature Engineering at Scale
- Performing feature engineering with Spark MLlib: handling missing values, encoding categorical variables and feature scaling
- Designing reusable preprocessing steps and preparing datasets for Machine Learning pipelines
- Introduction to feature selection and handling imbalanced datasets
Module 5: Machine Learning with Spark MLlib
- Understanding MLlib architecture and the Estimator/Transformer pattern
- Training regression and classification models at scale (Linear Regression, Logistic Regression, Decision Trees, Random Forest)
- Comparing models and interpreting results in distributed Machine Learning workflows
Module 6: End-to-End ML Pipelines
- Building end-to-end Machine Learning pipelines combining preprocessing, feature engineering and modelling
- Applying train/validation/test split strategies
- Performing cross-validation and hyperparameter tuning using grid search and random search
- Structuring reproducible Machine Learning experiments
Module 7: Model Evaluation & Practical ML Decision Making
- Applying appropriate evaluation metrics for regression and classification problems
- Identifying overfitting and underfitting and making practical model selection decisions
- Interpreting feature importance and understanding model behaviour
Module 8: Production & Enterprise Practices
- Persisting and loading models in Spark
- Implementing batch inference workflows on large datasets
- Understanding the Machine Learning lifecycle in enterprise environments
- Introduction to versioning, experiment tracking concepts and basic testing strategies
Practical Outcome
- Ability to work autonomously with PySpark
- Ability to process large datasets efficiently
- Ability to perform feature engineering at scale
- Ability to build scalable Machine Learning pipelines