BIGDATA STACK - Spark for Developers - Bespoke

Course Code: bigdasparfordevbesp

Duration: 21 hours

Overview:

Format of the Course

Programming Language: Scala & Apache Core (Opensource) which is the basis of any distribution

Course Outline:

Apache Spark

Big data basics and need for distributed computing
Architecture and concepts
- Scalability options and need
- Locality reference
- Fault tolerance via replication
- Distributed storage and parallel processing need
- Codecs, compression and data formats
Why MapReduce is not enough
The role and need of Spark in big data processing
In-memory processing vs caching
Spark architecture, its components and APIs
Understanding Apache Spark core distribution and its alternatives
Options to work with Spark - (cloud / on-premise / VMs / containers / K8s
Spark with or without Hadoop and its YARN layer
Basic setup and working with local installation - (local setup without any cluster)
Getting to know Scala and its interaction with Spark

Spark installation - Standalone cluster (with 1 or 2 nodes) and Spark UI
Working with Spark interactive shells - Spark-shell (scala)
- Knowing about other options
  - PySpark (Python)
  - Spark-Sql (SQL)
Read and write path - interacting with storage/data lakes
Understanding Spark APIs (RDDs / Dataframes / Datasets) and working
- RDDs - unstructured and semi-structured data processing
  - RDD operations: Transformations and Actions
  - Single/Pair/Multiple RDDs and operations
  - Understanding Application > Jobs > Stages > Tasks
- Dataframes - structured data processing
  - SQL API (Dataframes and Datasets) : Transformations and Actions
  - Dataframes, Tables, Views - Loading, Querying and Writing data
  - Handling formats, compression, single/multiple files

(Intermediate - Advanced)

Datasets and serialization
Operations : Transformations and actions
Dataframes and shuffling, exploring UI and DAGs
Broadcast variables and accumulators
Spark internals : Partitioning, Caching/Persistance, Application hierarchy, Lineage, DAGs, Executors, Parallelism, fault tolerance, shuffling etc
Developing, Deploying and running packaged Spark applications
Spark-submit (packaged applications)
Debugging/Configuration changes/Resource allocations

Spark 2.x vs 3.x (similarities, differences, advances, features, backend services)

Monitoring Spark applications and cluster resources
Spark critical configurations and tuning Spark/applications for optimized performance

(Advanced) - [Depending on time constraint and pace]

Identifying and resolving common issues and best practices