Hadoop Administration: Advanced

Course Code: aha

Duration: 28 hours

Prerequisites:

Basic Hadoop Administration

Course Outline:

Unit 1: Introduction: What is Big Data & Why Big data?

Most Common New Types of Data
What is Hadoop & Hadoop Ecosystem
Traditional Systems vs. Hadoop
Overview of a Hadoop Cluster
Comparison of different distributions
Understanding Cloudera Distribution of Hadoop: CDH
Understanding Hortonworks Data Platform : HDP
Understanding Apache Hadoop Distribution
Real time Use Cases

Unit 2: Understanding Hadoop Services

Understanding HDFS Architecture
Understanding Cloudera Manager
Undertanding Ambari.
A brief on HDFS Clients, EDGE nodes, Master & worker nodes.
Understanding reads, writes on HDFS and related issues.

Unit 3: Installation Prerequisites and Planning

Minimum Hardware Requirements
Minimum Software Requirements
Understanding variables in context of cluster setup & brief on cluster planning.
A quick comparison of features/commands/scripts in Hadoop 1.x/2.x

Unit 4: Configuring Hadoop –Setup Hadoop Cluster (Apache & CDH/HDP)

Configuration Considerations & important configs for Hadoop & services.
Understanding internals of a running cluster – Block pool ID, ClusterID, NSID,
Simulating scenarios of Over/Under replicated blocks/corrupt blocks, understanding , replication rules, different replication possibilities, rack awareness,
secondarynamenode’s checkpointing process .
Building of simple monitoring scripts, automation possibilities,
tasks and capturing information.
Working with Hadoop cluster :
- Commisioning/Decommisioning of nodes in Cluster
- Stopping, Starting, adding & removing HDP /CDH Services
- Using HDFS Commands & services.
- Accessing services through UI or web interface such as HUE/View.
Analyzing problems and resolving them : Some examples from live real time environments

Unit 5: Understanding Computation Frameworks

Understanding YARN and MapReduce
Lifecycle of a YARN Application & its components .
Understanding in memory computing framework: Spark
Spark Internals and architecture.
Configuring YARN
Setup of Apache Spark as standalone and integration with YARN.
Running & Troubleshooting MapReduce Jobs
Running jobs for cluster stress testing and performance testing
Running jobs through terminal/web based UI like Hue.
Running Spark Applications and troubleshooting
Performance Tuning considerations.

Unit 6: Job Schedulers

Overview of Job Scheduling
The Built-in Schedulers
Understanding FIFO, Fair Scheduler and Capacity Scheduler.
Configuring the Schedulers, Queues and resource allocation.
Configuring Capacity & User Limits

Unit 7: Ensuring Data Integrity

Limiting data and count of objects using Quotas
Limiting read/write access to cluster
Understanding ACLS: For services and Mapreduce jobs

Unit 8: Data Ingestion & available tools

Distributed Copy (distcp)
Ingesting Structured Data: Sqoop
Ingesting Streaming Data: Flume
Understanding Different Data Formats.
Brief on Apache Kafka and Internals.

Unit 9: Hive & Impala

Introduction to Hive & Impala
Comparing Hive with RDBMS
Hive Services : CLI, MetaStore & HiveServer2
Hive Setup in different ways and working with CLI & client
Working with Hive SQL Statements & usage of collection/primitive datatypes
Working with Hive Structures , functions, joins, various hive properties.
Understanding partitioning, bucketing
Hive/MR versus Hive/Tez
Understanding compression and different formats supported
Hive Security & tuning.
Working with Impala.
Some real time issues.

Unit 10: Workflow scheduling using Oozie

Oozie Overview & its Components
Jobs, Workflows, Coordinators, Bundles
Workflow Actions and Decisions
Oozie Actions , Job Submission
Oozie Server Workflow Coordinator
Oozie Console & Interfaces to Oozie
Oozie Server Configuration
Oozie Scripts & Using the Oozie CLI

Unit 11: NameNode High Availability & possibilities

HDFS HA Components & its working
Understanding setup and failovers.

Unit 12: Hbase

What are NOSQL databases and why Hbase
Hbase architecture and understanding internals.
Setup of Hbase in different modes
Working with Hbase shell and table operations.
Working with Hbase inbuilt tools & data.
Setup and working with HBASE Admin API using IDE like eclipse.
Monitoring Hbase cluster, load balancing & tuning.

Unit 13: Monitoring HDP2 Services/CDH Services

Monitoring Architecture
Monitoring JVM Processes
Understanding JVM Memory

Unit 14: Understanding and working with Kerberos

KDC fundamentals
Setup Kerberos DB, clients and enable Kerberized cluster
Administration issues in Kerberized Cluster.

Unit 15: Understanding and working with Knox

What is Apache Knox API Gateway
Knox gateway and integration with Hadoop ecosystem.
Configuring new services and Federation
Auditing by Knox

Unit 16: Understanding and working with Apache Ranger(HDP) or Sentry(CDH)

Understanding security administration and access controls.
Working with REST API and central UI
Understanding Ranger KMS.
Brief on Auditing of user access and administration
Best Practices and real time scenarios.

Unit 15: Sumary | FAQs |

Hadoop Administration: Advanced

Unit 1: Introduction: What is Big Data & Why Big data?

Most Common New Types of Data

Unit 4: Configuring Hadoop –Setup Hadoop Cluster (Apache & CDH/HDP)

Configuration Considerations & important configs for Hadoop & services.

Unit 5: Understanding Computation Frameworks

Understanding YARN and MapReduce

Unit 6: Job Schedulers

Overview of Job Scheduling

Unit 7: Ensuring Data Integrity

Limiting data and count of objects using Quotas

Unit 8: Data Ingestion & available tools

Distributed Copy (distcp)

Ingesting Structured Data: Sqoop

Unit 9: Hive & Impala

Introduction to Hive & Impala

Unit 10: Workflow scheduling using Oozie

Oozie Overview & its Components

Unit 11: NameNode High Availability & possibilities

HDFS HA Components & its working

What are NOSQL databases and why Hbase

Unit 13: Monitoring HDP2 Services/CDH Services

Monitoring Architecture

Unit 16: Understanding and working with Apache Ranger(HDP) or Sentry(CDH)

Understanding security administration and access controls.