Big Data Architecture | NobleProg HR

Course Code: bigdatarch

Duration: 21 hours

Overview:

___ is ___.

This instructor-led, live training (online or onsite) is aimed at beginner-level / intermediate-level / advanced-level ___ who wish to use ___ to ___.

By the end of this training, participants will be able to:

Install and configure ___.
___.
___.
___.

Format of the Course

Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.

Course Customization Options

To request a customized training for this course, please contact us to arrange.

Course Outline:

Day 1: Understanding Distributed Systems and Data Storage with HDFS

Session 1: Introduction to Distributed Data Systems
● Overview of distributed data storage and processing in modern data architectures.
● Introduction to Cloudera Data Engineering tools and their role in managing big data.
Session 2: HDFS – Hadoop Distributed File System
● Key Concepts: Introduction to HDFS and its role in data storage across clusters.
● Components Overview: Explore the architecture of HDFS, including Name Node, Data Node, and their interactions.
● Hands-On Exercise: Interacting with HDFS for uploading, retrieving, and managing data.
Session 3: Advanced HDFS Features and Ozone
● Additional HDFS functionalities and best practices for data management.
● Introduction to Ozone as an alternative to HDFS for scalable, cloud-native object storage.
● Exercise: Working with HDFS to perform file system operations and explore Ozone.

Day 2: Data Processing with YARN, RDDs, and Data Frames
Session 1: YARN – Resource Management in Distributed Environments
● YARN Overview: Introduction to YARN (Yet Another Resource Negotiator) and its role in resource management.
● YARN Architecture: Key components such as Resource Manager, Node Manager, and how they interact with applications.
● Exercise: Managing applications and resources using YARN.
Session 2: Working with RDDs (Resilient Distributed Datasets)
● RDD Concepts: Introduction to RDDs for fault-tolerant data processing.
● Transforming and managing large-scale datasets using RDD operations.
● Hands-On Exercise: Implementing RDD transformations and actions.
Session 3: Data Frames for High-Level Data Operations
● Introduction to Data Frames: A simplified and more efficient data structure for processing large datasets.
● Hands-On Exercises:
○ Reading and writing Data Frames from various data sources.
○ Manipulating Data Frames: Working with columns, complex types, and transformations.
○ Grouping, summarizing, and applying user-defined functions (UDFs) to Data Frames.
○ Working with window functions for advanced data analysis.

Day 3: Advanced Data Processing with Apache Hive, Spark, and Data Engineering Tools
Session 1: Apache Hive – Structured Data Management
● Introduction to Hive: Managing large datasets with Hive's SQL-like query language (HiveQL).
● Data Transformation with Hive: Using HiveQL for data processing and transformation.
● Hands-On Exercises:
○ Partitioning and bucketing data in Hive for efficient querying.
○ Working with skewed data and SerDes for text data ingestion.
○ Denormalizing data using complex types in Hive.
Session 2: Spark Integration and Distributed Processing
● Spark and Hive Integration: Leveraging Spark’s in-memory processing for Hive data.
● Spark Distributed Processing: Understanding how Spark handles data distribution and optimization.
● Exercise: Running Spark queries integrated with Hive to process and analyze large datasets.
Session 3: Data Persistence and Performance Optimization
● Data Frame and RDD Persistence: Choosing persistence storage levels for optimized processing.
● Exercise: Persisting Data Frames and viewing persisted RDDs to manage data efficiently.
● Optimizing Spark Jobs: Techniques for improving workload performance and handling challenges like shuffle and data skew.
Session 4: Orchestrating and Managing Data Workflows
● Data Engineering with Cloudera: Managing complex workflows and job orchestration using Apache Airflow.
● Workload Optimization: Using Workload XM to identify and resolve performance bottlenecks in Spark jobs.
● Exercise: Automating data engineering pipelines, creating job workflows, and optimizing workloads using Airflow.