Course Code: bigdasparfordevbesp
Duration: 21 hours
Overview:

Format of the Course

  • Theoretical, Hands-on and Interactive

Programming Language: Scala & Apache Core (Opensource) which is the basis of any distribution

Course Outline:

Apache Spark

Module 1: Introduction (Brief/Revisit/Refresher/Beginner)

  • Big data basics and need for distributed computing
  • Architecture and concepts
    • Scalability options and need
    • Locality reference
    • Fault tolerance via replication
    • Distributed storage and parallel processing need
    • Codecs, compression and data formats
  • Why MapReduce is not enough
  • The role and need of Spark in big data processing
  • In-memory processing vs caching
  • Spark architecture, its components and APIs
  • Understanding Apache Spark core distribution and its alternatives
  • Options to work with Spark - (cloud / on-premise / VMs / containers / K8s
  • Spark with or without Hadoop and its YARN layer
  • Basic setup and working with local installation - (local setup without any cluster)
  • Getting to know Scala and its interaction with Spark

Module 2: Spark internals and working - (Beginner/Intermediate)

  • Spark installation - Standalone cluster (with 1 or 2 nodes) and Spark UI
  • Working with Spark interactive shells - Spark-shell (scala)
    • Knowing about other options
      • PySpark (Python)
      • Spark-Sql (SQL)
  • Read and write path - interacting with storage/data lakes
  • Understanding Spark APIs (RDDs / Dataframes / Datasets) and working
    • RDDs - unstructured and semi-structured data processing
      • RDD operations: Transformations and Actions
      • Single/Pair/Multiple RDDs and operations
      • Understanding Application > Jobs > Stages > Tasks
    • Dataframes - structured data processing
      • SQL API (Dataframes and Datasets) : Transformations and Actions
      • Dataframes, Tables, Views - Loading, Querying and Writing data
      • Handling formats, compression, single/multiple files

(Intermediate - Advanced)

  • Datasets and serialization
  • Operations : Transformations and actions
  • Dataframes and shuffling, exploring UI and DAGs
  • Broadcast variables and accumulators
  • Spark internals : Partitioning, Caching/Persistance, Application hierarchy, Lineage, DAGs, Executors, Parallelism, fault tolerance, shuffling etc
  • Developing, Deploying and running packaged Spark applications
  • Spark-submit (packaged applications)
  • Debugging/Configuration changes/Resource allocations

Spark 2.x vs 3.x (similarities, differences, advances, features, backend services)

Module 3: Administering Spark Cluster and applications

  • Monitoring Spark applications and cluster resources
  • Spark critical configurations and tuning Spark/applications for optimized performance

(Advanced) - [Depending on time constraint and pace]

  • Understanding Partitioning and Partitioners
  • Spark UI and important backend services
  • Understanding DAG and TaskSheduler
  • Spark Stream processing (unstructured and structured data)

Identifying and resolving common issues and best practices