Course Code: pysparkmlgsg
Duration: 21 hours
Prerequisites:

Participants should have the following background:

Basic Python programming knowledge including working with functions, data structures and libraries
Fundamental understanding of data analysis concepts such as datasets, transformations and aggregations
Basic knowledge of SQL and relational data concepts
Introductory understanding of Machine Learning concepts such as training datasets, features and evaluation metrics
Familiarity with command line environments and basic software development practices is recommended

Experience with Pandas, NumPy or similar data processing libraries is helpful but not mandatory.

Overview:

This training provides a practical introduction to building scalable data processing and Machine Learning workflows using PySpark. Participants learn how Apache Spark operates within modern Big Data ecosystems and how to efficiently process large datasets using distributed computing principles.

The course gradually moves from Spark architecture and DataFrame operations toward advanced topics such as feature engineering, Machine Learning model training and building end to end ML pipelines using Spark MLlib. Participants will also explore performance optimisation techniques, model evaluation strategies and enterprise practices for deploying Machine Learning workflows at scale.

Through practical exercises and real world inspired scenarios, participants will learn how to design efficient data pipelines, prepare datasets for Machine Learning and build distributed ML models capable of handling large volumes of data commonly found in enterprise environments.

By the end of the training, participants will understand how to integrate PySpark into modern data platforms and apply scalable Machine Learning techniques in production oriented environments.

Course Outline:

PySpark & Machine Learning 

Module 1: Big Data & Spark Foundations

  • Overview of the Big Data ecosystem and the role of Spark in modern data platforms
  • Understanding Spark architecture: driver, executors, cluster manager, lazy evaluation, DAG and execution planning
  • Differences between RDD and DataFrame APIs and when to use each approach
  • Creating and configuring SparkSession and understanding application configuration fundamentals

Module 2: PySpark DataFrames

  • Reading and writing data from enterprise sources and formats (CSV, JSON, Parquet, Delta)
  • Working with PySpark DataFrames: transformations, actions, column expressions, filtering, joins and aggregations
  • Implementing advanced operations such as window functions, handling timestamps and working with nested data
  • Applying data quality checks and writing reusable, maintainable PySpark code

Module 3: Processing Large Datasets Efficiently

  • Understanding performance fundamentals: partitioning strategies, shuffle behaviour, caching and persistence
  • Using optimisation techniques including broadcast joins and execution plan analysis
  • Efficient processing of large datasets and best practices for scalable data workflows
  • Understanding schema evolution and modern storage formats used in enterprise environments

Module 4: Feature Engineering at Scale

  • Performing feature engineering with Spark MLlib: handling missing values, encoding categorical variables and feature scaling
  • Designing reusable preprocessing steps and preparing datasets for Machine Learning pipelines
  • Introduction to feature selection and handling imbalanced datasets

Module 5: Machine Learning with Spark MLlib

  • Understanding MLlib architecture and the Estimator/Transformer pattern
  • Training regression and classification models at scale (Linear Regression, Logistic Regression, Decision Trees, Random Forest)
  • Comparing models and interpreting results in distributed Machine Learning workflows

Module 6: End-to-End ML Pipelines

  • Building end-to-end Machine Learning pipelines combining preprocessing, feature engineering and modelling
  • Applying train/validation/test split strategies
  • Performing cross-validation and hyperparameter tuning using grid search and random search
  • Structuring reproducible Machine Learning experiments

Module 7: Model Evaluation & Practical ML Decision Making

  • Applying appropriate evaluation metrics for regression and classification problems
  • Identifying overfitting and underfitting and making practical model selection decisions
  • Interpreting feature importance and understanding model behaviour

Module 8: Production & Enterprise Practices

  • Persisting and loading models in Spark
  • Implementing batch inference workflows on large datasets
  • Understanding the Machine Learning lifecycle in enterprise environments
  • Introduction to versioning, experiment tracking concepts and basic testing strategies

 

Practical Outcome

  • Ability to work autonomously with PySpark
  • Ability to process large datasets efficiently
  • Ability to perform feature engineering at scale
  • Ability to build scalable Machine Learning pipelines

Overview in Category:

This training provides a practical introduction to building scalable data processing and Machine Learning workflows using PySpark. Participants learn how Apache Spark operates within modern Big Data ecosystems and how to efficiently process large datasets using distributed computing principles.

Sites Published:

United Arab Emirates - PySpark and Machine Learning

Qatar - PySpark and Machine Learning

Egypt - PySpark and Machine Learning

Saudi Arabia - PySpark and Machine Learning

South Africa - PySpark and Machine Learning

Brasil - PySpark e Machine Learning

Canada - PySpark and Machine Learning

中国 - PySpark与机器学习

香港 - PySpark and Machine Learning

澳門 - PySpark and Machine Learning

台灣 - PySpark與機器學習

USA - PySpark and Machine Learning

Österreich - PySpark und Machine Learning

Schweiz - PySpark und Machine Learning

Deutschland - PySpark und Machine Learning

Czech Republic - PySpark a strojové učení

Denmark - PySpark and Machine Learning

Estonia - PySpark and Machine Learning

Finland - PySpark and Machine Learning

Greece - PySpark και Μηχανική Μάθησης

Magyarország - PySpark és gépi tanulás

Ireland - PySpark and Machine Learning

Luxembourg - PySpark and Machine Learning

Latvia - PySpark and Machine Learning

España - PySpark y Aprendizaje Automático

Italia - PySpark e Machine Learning

Lithuania - PySpark and Machine Learning

Nederland - PySpark en Machine Learning

Norway - PySpark og Maskinlæring

Portugal - PySpark e Machine Learning

România - PySpark și Machine Learning

Sverige - PySpark och Maskininlärning

Türkiye - PySpark ve Makine Öğrenimi

Malta - PySpark and Machine Learning

Belgique - PySpark et Machine Learning

France - PySpark et Machine Learning

日本 - PySpark と機械学習

Australia - PySpark and Machine Learning

Malaysia - PySpark and Machine Learning

New Zealand - PySpark and Machine Learning

Philippines - PySpark and Machine Learning

Singapore - PySpark and Machine Learning

Thailand - PySpark and Machine Learning

Vietnam - PySpark và Học máy

India - PySpark and Machine Learning

Argentina - PySpark y Aprendizaje Automático

Chile - PySpark y Aprendizaje Automático

Costa Rica - PySpark y Aprendizaje Automático

Ecuador - PySpark y Aprendizaje Automático

Guatemala - PySpark y Aprendizaje Automático

Colombia - PySpark y Aprendizaje Automático

México - PySpark y Aprendizaje Automático

Panama - PySpark y Aprendizaje Automático

Peru - PySpark y Aprendizaje Automático

Uruguay - PySpark y Aprendizaje Automático

Venezuela - PySpark y Aprendizaje Automático

Polska - PySpark i Uczenie Maszynowe

United Kingdom - PySpark and Machine Learning

South Korea - PySpark 및 머신러닝

Pakistan - PySpark and Machine Learning

Sri Lanka - PySpark and Machine Learning

Bulgaria - PySpark и машинно обучение

Bolivia - PySpark y Aprendizaje Automático

Indonesia - PySpark and Machine Learning

Kazakhstan - PySpark and Machine Learning

Moldova - PySpark și Machine Learning

Morocco - PySpark and Machine Learning

Tunisia - PySpark and Machine Learning

Kuwait - PySpark and Machine Learning

Oman - PySpark and Machine Learning

Slovakia - PySpark and Machine Learning

Kenya - PySpark and Machine Learning

Nigeria - PySpark and Machine Learning

Botswana - PySpark and Machine Learning

Slovenia - PySpark and Machine Learning

Croatia - PySpark and Machine Learning

Serbia - PySpark and Machine Learning

Bhutan - PySpark and Machine Learning

Nepal - PySpark and Machine Learning

Uzbekistan - PySpark and Machine Learning

US Government - PySpark and Machine Learning