Course Code:
bdatastbesp
Duration:
105 hours
Course Outline:
Apache Hadoop (3.5 days)
Module 1: Introduction
- Big data basics & need for distributed computing. (brief)
- Apache hadoop & its distributions (brief)
- Understanding Apache Hadoop core distribution & its ecosystem
- Architecture & concepts
- Scalability options
- Locality reference
- Fault tolerance via replication
- Distributed storage & parallel processing
- Rack-awareness
- Services & processes
- Codecs & compression
- Metadata & data
- Options to work with Hadoop. (cloud / on –premise / VMs / containers)
- Basic setup & working with standalone installation
Module 2: Hadoop internals & working – Deep Dive
- Understanding metadata and data distribution.
- Read & write path
- Understanding storage layer - HDFS
- Understanding processing model & layer - MapReduce & YARN.
- Hadoop High Availability & federation options.
- Setting up Hadoop cluster (with or without HA [optional])
- Configurations
- Environment variables
- Logging
- Hadoop cluster administration & scenarios.
- Working with cluster – CLI, tools, programming interfaces.
- Data read & write, Data processing
- Hadoop ecosystem components (brief with examples)
Module 3: Cluster planning
- Picking a Distribution and Version of Hadoop
- Understanding workloads & deployment options
- Hardware & software selections
- Cluster Sizing & scalability plan.
- Disk, network & other considerations
- Given a scenario and workload pattern, identify a hardware configuration appropriate to the scenario
- Given a scenario, determine the ecosystem components your cluster needs to run in order to fulfil the SLA
- Cluster sizing: given a scenario and frequency of execution, identify the specifics for the workload, including
- CPU, memory, storage, disk I/O
- Disk Sizing and Configuration, including JBOD versus RAID, SANs, virtualization, and disk sizing requirements in a cluster
Module 4: Monitoring & logging
- Understand the functions and features of Hadoop metric collection abilities
- Using the NameNode, YARN, History server Web UIs
- Monitoring cluster daemons, resource usages & applications
- Looking into log files
- Capturing information using hadoop admin commands
- Understanding integrations with external options
Module 5: Security - Authentication, and Authorization
- Understanding user authentication , permissions & ACLs
- Understanding Kerberos
- Recommendations & best practices
Apache Spark (3.5 days)
Note** some concepts in Spark overlap with topics in Hadoop
Module 1: Introduction
- Why MapReduce is not enough
- The role & need of Spark in big data processing
- Spark architecture , its components & APIs
- Understanding Apache Spark core distribution & its alternatives
- Options to work with Spark. (cloud / on –premise / VMs / containers)
- Basic setup & working with local installation
Module 2: Spark internals & working – Deep Dive
- Spark installation – standalone cluster & Spark UI
- Spark installation – integration with Hadoop & YARN
- Working with spark interactive shells – PySpark (python)
- Knowing about other options
- Spark-shell (scala)
- Spark-Sql (sql)
- Spark-submit (packaged applications)
- Read & write path – interacting with storage
- Understanding Spark APIs (RDDs, Dataframes, Datasets) & working
- RDDs – unstructured & semi-structured data processing
- Dataframes & datasets – structured data processing
- Spark internals : partitioning, caching, application hierarchy, lineage, DAGs, executors
- Parallelism, fault tolerance, shuffling etc
- Developing, Deploying & running packaged spark applications
Module 3: Administering Spark Cluster & applications
- Monitoring Spark applications and cluster resources
- Spark in standalone vs. spark with Hadoop
- Tuning Spark/applications for optimal performance
- Identifying and resolving common issues
- Recommendations & best practices
Apache Kafka (3.5 days)
Module 1: Introduction
- Fundamentals of messaging systems. (brief)
- Use Cases of distributed messaging systems.(brief)
- What is Apache Kafka & why
- Apache Kafka vs. traditional messaging systems. (brief)
- Overview of Kafka Features and Ecosystem (brief)
- Apache Kafka On-premise vs. in the Cloud (brief)
- Apache Kafka variants & in Docker/Kubernetes (brief)
- Data formats & structures
- Stream processing vs. batch processing
Module 2: Understanding Kafka fundamentals
- Broker, Replicas, Controller, Leader, Follower, Topics, Partitions
- Messages, Offsets, Consumer Group, Log distribution, Zookeeper and its need, KRaft, compression, Compaction etc
- Kafka APIs: Producers, Consumers, Stream processing, Connector
Module 3: Kafka Installation & Setup
- Installing a standalone Kafka Broker (single node Kafka cluster)
- Installing and Configuring Apache Kafka multi-node cluster.
- Setting up Zookeeper to Manage the Kafka Cluster. (external)
- The Controller –as per new version where ZK is not used (brief introduction of KRaft)
- Testing the Cluster with ZK
- Setting up the Development Environment (IDE integration) [optional]
- Understanding basic important configs & working with command line.
Module 4: Deep Dive into Kafka
- Understanding Kafka Internals & architecture
- Leader, follower role for Kafka brokers (Load balancing)
- Zookeeper and its role (Leader-follower, Znodes)
- Understanding fault tolerance & messaging semantics
- Understanding reliability guarantees & consistency
- Physical storage & understanding underlying log segments/index files
- Log durability, retention & compaction
- The Minimum In-Sync Replicas
Module 5: Understanding Kafka APIs
- Kafka Java Client APIs
- Kafka Producer Java API
- Kafka Consumer Java API
- Kafka AdminClient Java API
- Kafka Producers
- Constructing a Kafka Producer, publishing messages, configurations for producers, understanding serializers, interceptors, headers, partitions, consistency, retries, compression, quotas/ throttling etc.
- Related configurations for optimized performance
- Kafka Consumers
- Constructing a consumer, working with consumers, consumer Groups
- Subscribing & consuming from topics
- Understanding Polling & heartbeat thread, commits & offsets, fetch behaviour, auto offset or preferred read, rebalancing listeners, serializers & deserializers
- Related configurations for optimized performance
- Kafka Streams Java API (brief & optional)
- Kafka Connect Java API (brief & optional)
Module 6: Administration & operations - CLI
- Topic operations
- Balancing leadership
- Checking consumer position & consumer groups
- Mirroring data between clusters
- Scaling or de-scaling cluster
- Clean-up policy (compact/delete)
- Dynamic Configuration Changes
- Partition Management
Module 7: Managing Kafka Programmatically
- AdminClient Overview & Lifecycle: Creating, Configuring and Closing
- Configuration management
- Consumer group management
- Cluster Metadata
Module 8: Monitoring & logging
- Using tools/CLI to monitor Kafka Cluster
- Understanding emitted metrics from Kafka & zookeepers
- Client, performance & Lag Monitoring
- Kafka logs
- Known issues and optimizing Kafka & its components
- Troubleshooting
- Clustering planning considerations
- Best practices & recommendations
FAQs
Apache Airflow (3.5 days)
Module 1: Introduction
- Need for Airflow and its alternatives
- Understanding DAGs, automation of data engineering flows
- Overview of Apache Airflow Features and Architecture
- Deployment & setup options
Module 2: Setup & working
- Setting up Apache Airflow (single node vs. multi node architecture)
- Navigating the Apache Airflow UI & its views
- Using the CLI
Module 3: Airflow internals & working
- Understanding DAGs, operators, providers, connections, sensors, hooks etc
- Working with databases, executors, workers & queues
- Working with Tasks, TaskGroups, DAGs, SubDags, XComs etc
- Reading & writing data
- Integrating with Spark, Kafka or Hadoop
- Creating flows & customizing
- Monitoring, logging flows
Recommendations & best practices