Course Code:
aha
Duration:
28 hours
Prerequisites:
Basic Hadoop Administration
Course Outline:
Unit 1: Introduction: What is Big Data & Why Big data?
-
Most Common New Types of Data
- What is Hadoop & Hadoop Ecosystem
- Traditional Systems vs. Hadoop
- Overview of a Hadoop Cluster
- Comparison of different distributions
- Understanding Cloudera Distribution of Hadoop: CDH
- Understanding Hortonworks Data Platform : HDP
- Understanding Apache Hadoop Distribution
- Real time Use Cases
Unit 2: Understanding Hadoop Services
- Understanding HDFS Architecture
- Understanding Cloudera Manager
- Undertanding Ambari.
- A brief on HDFS Clients, EDGE nodes, Master & worker nodes.
- Understanding reads, writes on HDFS and related issues.
Unit 3: Installation Prerequisites and Planning
- Minimum Hardware Requirements
- Minimum Software Requirements
- Understanding variables in context of cluster setup & brief on cluster planning.
- A quick comparison of features/commands/scripts in Hadoop 1.x/2.x
Unit 4: Configuring Hadoop –Setup Hadoop Cluster (Apache & CDH/HDP)
-
Configuration Considerations & important configs for Hadoop & services.
- Understanding internals of a running cluster – Block pool ID, ClusterID, NSID,
- Simulating scenarios of Over/Under replicated blocks/corrupt blocks, understanding , replication rules, different replication possibilities, rack awareness,
- secondarynamenode’s checkpointing process .
- Building of simple monitoring scripts, automation possibilities,
- tasks and capturing information.
- Working with Hadoop cluster :
- Commisioning/Decommisioning of nodes in Cluster
- Stopping, Starting, adding & removing HDP /CDH Services
- Using HDFS Commands & services.
- Accessing services through UI or web interface such as HUE/View. - Analyzing problems and resolving them : Some examples from live real time environments
Unit 5: Understanding Computation Frameworks
-
Understanding YARN and MapReduce
- Lifecycle of a YARN Application & its components .
- Understanding in memory computing framework: Spark
- Spark Internals and architecture.
- Configuring YARN
- Setup of Apache Spark as standalone and integration with YARN.
- Running & Troubleshooting MapReduce Jobs
- Running jobs for cluster stress testing and performance testing
- Running jobs through terminal/web based UI like Hue.
- Running Spark Applications and troubleshooting
- Performance Tuning considerations.
Unit 6: Job Schedulers
-
Overview of Job Scheduling
- The Built-in Schedulers
- Understanding FIFO, Fair Scheduler and Capacity Scheduler.
- Configuring the Schedulers, Queues and resource allocation.
- Configuring Capacity & User Limits
Unit 7: Ensuring Data Integrity
-
Limiting data and count of objects using Quotas
- Limiting read/write access to cluster
- Understanding ACLS: For services and Mapreduce jobs
Unit 8: Data Ingestion & available tools
-
Distributed Copy (distcp)
-
Ingesting Structured Data: Sqoop
- Ingesting Streaming Data: Flume
- Understanding Different Data Formats.
- Brief on Apache Kafka and Internals.
Unit 9: Hive & Impala
-
Introduction to Hive & Impala
- Comparing Hive with RDBMS
- Hive Services : CLI, MetaStore & HiveServer2
- Hive Setup in different ways and working with CLI & client
- Working with Hive SQL Statements & usage of collection/primitive datatypes
- Working with Hive Structures , functions, joins, various hive properties.
- Understanding partitioning, bucketing
- Hive/MR versus Hive/Tez
- Understanding compression and different formats supported
- Hive Security & tuning.
- Working with Impala.
- Some real time issues.
Unit 10: Workflow scheduling using Oozie
-
Oozie Overview & its Components
- Jobs, Workflows, Coordinators, Bundles
- Workflow Actions and Decisions
- Oozie Actions , Job Submission
- Oozie Server Workflow Coordinator
- Oozie Console & Interfaces to Oozie
- Oozie Server Configuration
- Oozie Scripts & Using the Oozie CLI
Unit 11: NameNode High Availability & possibilities
-
HDFS HA Components & its working
- Understanding setup and failovers.
Unit 12: Hbase
-
What are NOSQL databases and why Hbase
- Hbase architecture and understanding internals.
- Setup of Hbase in different modes
- Working with Hbase shell and table operations.
- Working with Hbase inbuilt tools & data.
- Setup and working with HBASE Admin API using IDE like eclipse.
- Monitoring Hbase cluster, load balancing & tuning.
Unit 13: Monitoring HDP2 Services/CDH Services
-
Monitoring Architecture
- Monitoring JVM Processes
- Understanding JVM Memory
Unit 14: Understanding and working with Kerberos
- KDC fundamentals
- Setup Kerberos DB, clients and enable Kerberized cluster
- Administration issues in Kerberized Cluster.
Unit 15: Understanding and working with Knox
- What is Apache Knox API Gateway
- Knox gateway and integration with Hadoop ecosystem.
- Configuring new services and Federation
- Auditing by Knox
Unit 16: Understanding and working with Apache Ranger(HDP) or Sentry(CDH)
-
Understanding security administration and access controls.
- Working with REST API and central UI
- Understanding Ranger KMS.
- Brief on Auditing of user access and administration
- Best Practices and real time scenarios.
Unit 15: Sumary | FAQs |