Course Code:
bigdasparfordevbesp
Duration:
21 hours
Overview:
Format of the Course
- Theoretical, Hands-on and Interactive
Programming Language: Scala & Apache Core (Opensource) which is the basis of any distribution
Course Outline:
Apache Spark
Module 1: Introduction (Brief/Revisit/Refresher/Beginner)
- Big data basics and need for distributed computing
- Architecture and concepts
- Scalability options and need
- Locality reference
- Fault tolerance via replication
- Distributed storage and parallel processing need
- Codecs, compression and data formats
- Why MapReduce is not enough
- The role and need of Spark in big data processing
- In-memory processing vs caching
- Spark architecture, its components and APIs
- Understanding Apache Spark core distribution and its alternatives
- Options to work with Spark - (cloud / on-premise / VMs / containers / K8s
- Spark with or without Hadoop and its YARN layer
- Basic setup and working with local installation - (local setup without any cluster)
- Getting to know Scala and its interaction with Spark
Module 2: Spark internals and working - (Beginner/Intermediate)
- Spark installation - Standalone cluster (with 1 or 2 nodes) and Spark UI
- Working with Spark interactive shells - Spark-shell (scala)
- Knowing about other options
- PySpark (Python)
- Spark-Sql (SQL)
- Knowing about other options
- Read and write path - interacting with storage/data lakes
- Understanding Spark APIs (RDDs / Dataframes / Datasets) and working
- RDDs - unstructured and semi-structured data processing
- RDD operations: Transformations and Actions
- Single/Pair/Multiple RDDs and operations
- Understanding Application > Jobs > Stages > Tasks
- Dataframes - structured data processing
- SQL API (Dataframes and Datasets) : Transformations and Actions
- Dataframes, Tables, Views - Loading, Querying and Writing data
- Handling formats, compression, single/multiple files
- RDDs - unstructured and semi-structured data processing
(Intermediate - Advanced)
- Datasets and serialization
- Operations : Transformations and actions
- Dataframes and shuffling, exploring UI and DAGs
- Broadcast variables and accumulators
- Spark internals : Partitioning, Caching/Persistance, Application hierarchy, Lineage, DAGs, Executors, Parallelism, fault tolerance, shuffling etc
- Developing, Deploying and running packaged Spark applications
- Spark-submit (packaged applications)
- Debugging/Configuration changes/Resource allocations
Spark 2.x vs 3.x (similarities, differences, advances, features, backend services)
Module 3: Administering Spark Cluster and applications
- Monitoring Spark applications and cluster resources
- Spark critical configurations and tuning Spark/applications for optimized performance
(Advanced) - [Depending on time constraint and pace]
- Understanding Partitioning and Partitioners
- Spark UI and important backend services
- Understanding DAG and TaskSheduler
- Spark Stream processing (unstructured and structured data)
Identifying and resolving common issues and best practices