Course Code: pysparkmlgsg
Duration: 21 hours
Prerequisites:

-

Overview:

-

Course Outline:

PySpark & Machine Learning 

Module 1: Big Data & Spark Foundations

  • Overview of the Big Data ecosystem and the role of Spark in modern data platforms
  • Understanding Spark architecture: driver, executors, cluster manager, lazy evaluation, DAG and execution planning
  • Differences between RDD and DataFrame APIs and when to use each approach
  • Creating and configuring SparkSession and understanding application configuration fundamentals

Module 2: PySpark DataFrames

  • Reading and writing data from enterprise sources and formats (CSV, JSON, Parquet, Delta)
  • Working with PySpark DataFrames: transformations, actions, column expressions, filtering, joins and aggregations
  • Implementing advanced operations such as window functions, handling timestamps and working with nested data
  • Applying data quality checks and writing reusable, maintainable PySpark code

Module 3: Processing Large Datasets Efficiently

  • Understanding performance fundamentals: partitioning strategies, shuffle behaviour, caching and persistence
  • Using optimisation techniques including broadcast joins and execution plan analysis
  • Efficient processing of large datasets and best practices for scalable data workflows
  • Understanding schema evolution and modern storage formats used in enterprise environments

Module 4: Feature Engineering at Scale

  • Performing feature engineering with Spark MLlib: handling missing values, encoding categorical variables and feature scaling
  • Designing reusable preprocessing steps and preparing datasets for Machine Learning pipelines
  • Introduction to feature selection and handling imbalanced datasets

Module 5: Machine Learning with Spark MLlib

  • Understanding MLlib architecture and the Estimator/Transformer pattern
  • Training regression and classification models at scale (Linear Regression, Logistic Regression, Decision Trees, Random Forest)
  • Comparing models and interpreting results in distributed Machine Learning workflows

Module 6: End-to-End ML Pipelines

  • Building end-to-end Machine Learning pipelines combining preprocessing, feature engineering and modelling
  • Applying train/validation/test split strategies
  • Performing cross-validation and hyperparameter tuning using grid search and random search
  • Structuring reproducible Machine Learning experiments

Module 7: Model Evaluation & Practical ML Decision Making

  • Applying appropriate evaluation metrics for regression and classification problems
  • Identifying overfitting and underfitting and making practical model selection decisions
  • Interpreting feature importance and understanding model behaviour

Module 8: Production & Enterprise Practices

  • Persisting and loading models in Spark
  • Implementing batch inference workflows on large datasets
  • Understanding the Machine Learning lifecycle in enterprise environments
  • Introduction to versioning, experiment tracking concepts and basic testing strategies

 

Practical Outcome

  • Ability to work autonomously with PySpark
  • Ability to process large datasets efficiently
  • Ability to perform feature engineering at scale
  • Ability to build scalable Machine Learning pipelines