Course Code: bdatastbesp
Duration: 105 hours
Course Outline:

Apache Hadoop (3.5 days)

Module 1: Introduction

  • Big data basics & need for distributed computing. (brief)
  • Apache hadoop & its distributions (brief)
  • Understanding Apache Hadoop core distribution & its ecosystem
  • Architecture & concepts
  • Scalability options
  • Locality reference
  • Fault tolerance via replication
  • Distributed storage & parallel processing
  • Rack-awareness
  • Services & processes
  • Codecs & compression
  • Metadata & data
  • Options to work with Hadoop. (cloud / on –premise / VMs / containers)
  • Basic setup & working with standalone installation

Module 2: Hadoop internals & working – Deep Dive

  • Understanding metadata and data distribution.
  • Read & write path
  • Understanding storage layer - HDFS
  • Understanding processing model & layer - MapReduce & YARN.
  • Hadoop High Availability & federation options.
  • Setting up Hadoop cluster (with or without HA [optional])
  • Configurations
  • Environment variables
  • Logging
  • Hadoop cluster administration & scenarios.
  • Working with cluster – CLI, tools, programming interfaces.
  • Data read & write, Data processing
  • Hadoop ecosystem components (brief with examples)

Module 3: Cluster planning

  • Picking a Distribution and Version of Hadoop
  • Understanding workloads & deployment options
  • Hardware & software selections
  • Cluster Sizing & scalability plan.
  • Disk, network & other considerations
    • Given a scenario and workload pattern, identify a hardware configuration appropriate to the scenario
    • Given a scenario, determine the ecosystem components your cluster needs to run in order to fulfil the SLA
    • Cluster sizing: given a scenario and frequency of execution, identify the specifics for the workload, including
    • CPU, memory, storage, disk I/O
    • Disk Sizing and Configuration, including JBOD versus RAID, SANs, virtualization, and disk sizing requirements in a cluster

Module 4: Monitoring & logging

  • Understand the functions and features of Hadoop metric collection abilities
  • Using the NameNode, YARN, History server Web UIs
  • Monitoring cluster daemons, resource usages & applications
  • Looking into log files
  • Capturing information using hadoop admin commands
  • Understanding integrations with external options

Module 5: Security - Authentication, and Authorization

  • Understanding user authentication , permissions & ACLs
  • Understanding Kerberos
  • Recommendations & best practices

Apache Spark (3.5 days)

Note** some concepts in Spark overlap with topics in Hadoop

Module 1: Introduction

  • Why MapReduce is not enough
  • The role & need of Spark in big data processing
  • Spark architecture , its components & APIs
  • Understanding Apache Spark core distribution & its alternatives
  • Options to work with Spark. (cloud / on –premise / VMs / containers)
  • Basic setup & working with local installation

Module 2: Spark internals & working – Deep Dive

  • Spark installation – standalone cluster & Spark UI
  • Spark installation – integration with Hadoop & YARN
  • Working with spark interactive shells – PySpark (python)
    • Knowing about other options
    • Spark-shell (scala)
    • Spark-Sql (sql)
    • Spark-submit (packaged applications)
  • Read & write path – interacting with storage
  • Understanding Spark APIs (RDDs, Dataframes, Datasets) & working
    • RDDs – unstructured & semi-structured data processing
    • Dataframes & datasets – structured data processing
  • Spark internals : partitioning, caching, application hierarchy, lineage, DAGs, executors
  • Parallelism, fault tolerance, shuffling etc
  • Developing, Deploying & running packaged spark applications

Module 3: Administering Spark Cluster & applications

  • Monitoring Spark applications and cluster resources
  • Spark in standalone vs. spark with Hadoop
  • Tuning Spark/applications for optimal performance
  • Identifying and resolving common issues
  • Recommendations & best practices

Apache Kafka (3.5 days)

Module 1: Introduction

  • Fundamentals of messaging systems. (brief)
  • Use Cases of distributed messaging systems.(brief)
  • What is Apache Kafka & why
  • Apache Kafka vs. traditional messaging systems. (brief)
  • Overview of Kafka Features and Ecosystem (brief)
  • Apache Kafka On-premise vs. in the Cloud (brief)
  • Apache Kafka variants & in Docker/Kubernetes (brief)
  • Data formats & structures
  • Stream processing vs. batch processing

Module 2: Understanding Kafka fundamentals

  • Broker, Replicas, Controller, Leader, Follower, Topics, Partitions
  • Messages, Offsets, Consumer Group, Log distribution, Zookeeper and its need, KRaft, compression, Compaction etc
  • Kafka APIs: Producers, Consumers, Stream processing, Connector

Module 3: Kafka Installation & Setup

  • Installing a standalone Kafka Broker (single node Kafka cluster)
  • Installing and Configuring Apache Kafka multi-node cluster.
  • Setting up Zookeeper to Manage the Kafka Cluster. (external)
  • The Controller –as per new version where ZK is not used (brief introduction of KRaft)
  • Testing the Cluster with ZK
  • Setting up the Development Environment (IDE integration) [optional]
  • Understanding basic important configs & working with command line.

Module 4: Deep Dive into Kafka

  • Understanding Kafka Internals & architecture
  • Leader, follower role for Kafka brokers (Load balancing)
  • Zookeeper and its role (Leader-follower, Znodes)
  • Understanding fault tolerance & messaging semantics
  • Understanding reliability guarantees & consistency
  • Physical storage & understanding underlying log segments/index files
  • Log durability, retention & compaction
  • The Minimum In-Sync Replicas

Module 5: Understanding Kafka APIs

  • Kafka Java Client APIs
  • Kafka Producer Java API
  • Kafka Consumer Java API
  • Kafka AdminClient Java API
  • Kafka Producers
    • Constructing a Kafka Producer, publishing messages, configurations for producers, understanding serializers, interceptors, headers, partitions, consistency, retries, compression, quotas/ throttling etc.
    • Related configurations for optimized performance
  • Kafka Consumers
    • Constructing a consumer, working with consumers, consumer Groups
    • Subscribing & consuming from topics
    • Understanding Polling & heartbeat thread, commits & offsets, fetch behaviour, auto offset or preferred read, rebalancing listeners, serializers & deserializers
    • Related configurations for optimized performance
  • Kafka Streams Java API (brief & optional)
  • Kafka Connect Java API (brief & optional)

Module 6: Administration & operations - CLI

  • Topic operations
  • Balancing leadership
  • Checking consumer position & consumer groups
  • Mirroring data between clusters
  • Scaling or de-scaling cluster
  • Clean-up policy (compact/delete)
  • Dynamic Configuration Changes
  • Partition Management

Module 7: Managing Kafka Programmatically

  • AdminClient Overview & Lifecycle: Creating, Configuring and Closing
  • Configuration management
  • Consumer group management
  • Cluster Metadata

Module 8: Monitoring & logging

  • Using tools/CLI to monitor Kafka Cluster
  • Understanding emitted metrics from Kafka & zookeepers
  • Client, performance & Lag Monitoring
  • Kafka logs
  • Known issues and optimizing Kafka & its components
  • Troubleshooting
  • Clustering planning considerations
  • Best practices & recommendations

FAQs

Apache Airflow (3.5 days)

Module 1: Introduction

  • Need for Airflow and its alternatives
  • Understanding DAGs, automation of data engineering flows
  • Overview of Apache Airflow Features and Architecture
  • Deployment & setup options

Module 2: Setup & working

  • Setting up Apache Airflow (single node vs. multi node architecture)
  • Navigating the Apache Airflow UI & its views
  • Using the CLI

Module 3: Airflow internals & working

  • Understanding DAGs, operators, providers, connections, sensors, hooks etc
  • Working with databases, executors, workers & queues
  • Working with Tasks, TaskGroups, DAGs, SubDags, XComs etc
  • Reading & writing data
  • Integrating with Spark, Kafka or Hadoop
  • Creating flows & customizing
  • Monitoring, logging flows

Recommendations & best practices