PySpark and Machine Learning

Course Code: pysparkmlgsg

Duration: 21 hours

Prerequisites:

Participants should have the following background:

Basic Python programming knowledge including working with functions, data structures and libraries
Fundamental understanding of data analysis concepts such as datasets, transformations and aggregations
Basic knowledge of SQL and relational data concepts
Introductory understanding of Machine Learning concepts such as training datasets, features and evaluation metrics
Familiarity with command line environments and basic software development practices is recommended

Experience with Pandas, NumPy or similar data processing libraries is helpful but not mandatory.

Overview:

This training provides a practical introduction to building scalable data processing and Machine Learning workflows using PySpark. Participants learn how Apache Spark operates within modern Big Data ecosystems and how to efficiently process large datasets using distributed computing principles.

The course gradually moves from Spark architecture and DataFrame operations toward advanced topics such as feature engineering, Machine Learning model training and building end to end ML pipelines using Spark MLlib. Participants will also explore performance optimisation techniques, model evaluation strategies and enterprise practices for deploying Machine Learning workflows at scale.

Through practical exercises and real world inspired scenarios, participants will learn how to design efficient data pipelines, prepare datasets for Machine Learning and build distributed ML models capable of handling large volumes of data commonly found in enterprise environments.

By the end of the training, participants will understand how to integrate PySpark into modern data platforms and apply scalable Machine Learning techniques in production oriented environments.

Course Outline:

PySpark & Machine Learning

Module 1: Big Data & Spark Foundations

Overview of the Big Data ecosystem and the role of Spark in modern data platforms
Understanding Spark architecture: driver, executors, cluster manager, lazy evaluation, DAG and execution planning
Differences between RDD and DataFrame APIs and when to use each approach
Creating and configuring SparkSession and understanding application configuration fundamentals

Module 2: PySpark DataFrames

Reading and writing data from enterprise sources and formats (CSV, JSON, Parquet, Delta)
Working with PySpark DataFrames: transformations, actions, column expressions, filtering, joins and aggregations
Implementing advanced operations such as window functions, handling timestamps and working with nested data
Applying data quality checks and writing reusable, maintainable PySpark code

Module 3: Processing Large Datasets Efficiently

Understanding performance fundamentals: partitioning strategies, shuffle behaviour, caching and persistence
Using optimisation techniques including broadcast joins and execution plan analysis
Efficient processing of large datasets and best practices for scalable data workflows
Understanding schema evolution and modern storage formats used in enterprise environments

Module 4: Feature Engineering at Scale

Performing feature engineering with Spark MLlib: handling missing values, encoding categorical variables and feature scaling
Designing reusable preprocessing steps and preparing datasets for Machine Learning pipelines
Introduction to feature selection and handling imbalanced datasets

Module 5: Machine Learning with Spark MLlib

Understanding MLlib architecture and the Estimator/Transformer pattern
Training regression and classification models at scale (Linear Regression, Logistic Regression, Decision Trees, Random Forest)
Comparing models and interpreting results in distributed Machine Learning workflows

Module 6: End-to-End ML Pipelines

Building end-to-end Machine Learning pipelines combining preprocessing, feature engineering and modelling
Applying train/validation/test split strategies
Performing cross-validation and hyperparameter tuning using grid search and random search
Structuring reproducible Machine Learning experiments

Module 7: Model Evaluation & Practical ML Decision Making

Applying appropriate evaluation metrics for regression and classification problems
Identifying overfitting and underfitting and making practical model selection decisions
Interpreting feature importance and understanding model behaviour

Module 8: Production & Enterprise Practices

Persisting and loading models in Spark
Implementing batch inference workflows on large datasets
Understanding the Machine Learning lifecycle in enterprise environments
Introduction to versioning, experiment tracking concepts and basic testing strategies

Practical Outcome

Ability to work autonomously with PySpark
Ability to process large datasets efficiently
Ability to perform feature engineering at scale
Ability to build scalable Machine Learning pipelines

Overview in Category:

This training provides a practical introduction to building scalable data processing and Machine Learning workflows using PySpark. Participants learn how Apache Spark operates within modern Big Data ecosystems and how to efficiently process large datasets using distributed computing principles.

Sites Published:

United Arab Emirates - PySpark and Machine Learning

Qatar - PySpark and Machine Learning

Egypt - PySpark and Machine Learning

Saudi Arabia - PySpark and Machine Learning

South Africa - PySpark and Machine Learning

Brasil - PySpark e Machine Learning

Canada - PySpark and Machine Learning

中国 - PySpark与机器学习

香港 - PySpark and Machine Learning

澳門 - PySpark and Machine Learning

台灣 - PySpark與機器學習

USA - PySpark and Machine Learning

Österreich - PySpark und Machine Learning

Schweiz - PySpark und Machine Learning

Deutschland - PySpark und Machine Learning

Czech Republic - PySpark a strojové učení

Denmark - PySpark and Machine Learning

Estonia - PySpark and Machine Learning

Finland - PySpark and Machine Learning

Greece - PySpark και Μηχανική Μάθησης

Magyarország - PySpark és gépi tanulás

Ireland - PySpark and Machine Learning

Luxembourg - PySpark and Machine Learning

Latvia - PySpark and Machine Learning

España - PySpark y Aprendizaje Automático

Italia - PySpark e Machine Learning

Lithuania - PySpark and Machine Learning

Nederland - PySpark en Machine Learning

Norway - PySpark og Maskinlæring

Portugal - PySpark e Machine Learning

România - PySpark și Machine Learning

Sverige - PySpark och Maskininlärning

Türkiye - PySpark ve Makine Öğrenimi

Malta - PySpark and Machine Learning

Belgique - PySpark et Machine Learning

France - PySpark et Machine Learning

日本 - PySpark と機械学習

Australia - PySpark and Machine Learning

Malaysia - PySpark and Machine Learning

New Zealand - PySpark and Machine Learning

Philippines - PySpark and Machine Learning

Singapore - PySpark and Machine Learning

Thailand - PySpark and Machine Learning

Vietnam - PySpark và Học máy

India - PySpark and Machine Learning

Argentina - PySpark y Aprendizaje Automático

Chile - PySpark y Aprendizaje Automático

Costa Rica - PySpark y Aprendizaje Automático

Ecuador - PySpark y Aprendizaje Automático

Guatemala - PySpark y Aprendizaje Automático

Colombia - PySpark y Aprendizaje Automático

México - PySpark y Aprendizaje Automático

Panama - PySpark y Aprendizaje Automático

Peru - PySpark y Aprendizaje Automático

Uruguay - PySpark y Aprendizaje Automático

Venezuela - PySpark y Aprendizaje Automático

Polska - PySpark i Uczenie Maszynowe

United Kingdom - PySpark and Machine Learning

South Korea - PySpark 및 머신러닝

Pakistan - PySpark and Machine Learning

Sri Lanka - PySpark and Machine Learning

Bulgaria - PySpark и машинно обучение

Bolivia - PySpark y Aprendizaje Automático

Indonesia - PySpark and Machine Learning

Kazakhstan - PySpark and Machine Learning

Moldova - PySpark și Machine Learning

Morocco - PySpark and Machine Learning

Tunisia - PySpark and Machine Learning

Kuwait - PySpark and Machine Learning

Oman - PySpark and Machine Learning

Slovakia - PySpark and Machine Learning

Kenya - PySpark and Machine Learning

Nigeria - PySpark and Machine Learning

Botswana - PySpark and Machine Learning

Slovenia - PySpark and Machine Learning

Croatia - PySpark and Machine Learning

Serbia - PySpark and Machine Learning

Bhutan - PySpark and Machine Learning

Nepal - PySpark and Machine Learning

Uzbekistan - PySpark and Machine Learning

US Government - PySpark and Machine Learning