Course Code: aha
Duration: 28 hours
Prerequisites:

Basic Hadoop Administration

Course Outline:

Unit 1: Introduction: What is Big Data & Why Big data?

  • Most Common New Types of Data

  • What is Hadoop & Hadoop Ecosystem
  • Traditional Systems vs. Hadoop
  • Overview of a Hadoop Cluster 
  • Comparison of different distributions
  • Understanding Cloudera Distribution of Hadoop: CDH
  • Understanding Hortonworks Data Platform : HDP
  • Understanding Apache Hadoop Distribution
  • Real time Use Cases

Unit 2: Understanding Hadoop Services

  • Understanding HDFS Architecture
  • Understanding Cloudera Manager
  • Undertanding Ambari.
  • A brief on HDFS Clients, EDGE nodes, Master & worker nodes.
  • Understanding reads, writes on HDFS and related issues.

Unit 3: Installation Prerequisites and Planning

  • Minimum Hardware Requirements
  • Minimum Software Requirements
  • Understanding variables in context of cluster setup & brief on cluster planning.
  • A quick comparison of features/commands/scripts in Hadoop 1.x/2.x

Unit 4: Configuring Hadoop –Setup Hadoop Cluster (Apache & CDH/HDP)

  • Configuration Considerations  & important configs for Hadoop & services.

  • Understanding internals of a running cluster – Block pool ID, ClusterID, NSID,
  • Simulating scenarios of Over/Under replicated blocks/corrupt blocks, understanding , replication rules, different replication possibilities, rack awareness,   
  • secondarynamenode’s checkpointing process .
  • Building of simple monitoring scripts,  automation possibilities,   
  • tasks and capturing information.
  • Working with Hadoop cluster :
    Commisioning/Decommisioning of nodes in Cluster
    Stopping, Starting, adding & removing HDP /CDH Services
    Using HDFS Commands & services.
    Accessing services through UI or web interface such as HUE/View.
  • Analyzing problems and resolving them : Some examples from live real time environments

Unit 5: Understanding Computation Frameworks

  •  Understanding YARN and MapReduce

  •  Lifecycle of a YARN Application & its components .
  •  Understanding in memory computing framework: Spark
  •  Spark Internals and architecture.
  •  Configuring YARN
  •  Setup of Apache Spark as standalone and integration with YARN.
  •  Running & Troubleshooting MapReduce Jobs
  •  Running jobs for cluster stress testing and performance testing
  •  Running jobs through terminal/web based UI like Hue.
  •  Running Spark Applications and troubleshooting
  •  Performance Tuning considerations.

Unit 6: Job Schedulers

  • Overview of Job Scheduling

  • The Built-in Schedulers
  • Understanding FIFO, Fair Scheduler and Capacity Scheduler.
  • Configuring the Schedulers, Queues and resource allocation.
  • Configuring Capacity & User Limits

Unit 7: Ensuring Data Integrity

  •  Limiting data and count of objects using Quotas

  •  Limiting read/write access to cluster
  •  Understanding ACLS: For services and Mapreduce jobs

 Unit 8: Data Ingestion & available tools

  • Distributed Copy (distcp)  

  • Ingesting Structured Data: Sqoop

  • Ingesting Streaming Data: Flume
  • Understanding Different Data Formats.
  • Brief on Apache Kafka and Internals.

Unit 9: Hive & Impala

  • Introduction to Hive & Impala

  • Comparing Hive with RDBMS
  • Hive Services : CLI, MetaStore  & HiveServer2
  • Hive Setup in different ways and working with CLI & client
  • Working with Hive SQL Statements & usage of collection/primitive datatypes
  • Working with Hive Structures , functions, joins,  various hive properties.
  • Understanding partitioning, bucketing
  • Hive/MR versus Hive/Tez
  • Understanding compression and different formats supported
  • Hive Security & tuning.
  • Working with Impala.
  • Some real time issues.

Unit 10: Workflow scheduling using Oozie

  •     Oozie Overview & its Components

  •     Jobs, Workflows, Coordinators, Bundles
  •   Workflow Actions and Decisions
  •   Oozie Actions , Job Submission
  •   Oozie Server Workflow Coordinator
  •   Oozie Console & Interfaces to Oozie
  •   Oozie Server Configuration
  •   Oozie Scripts & Using the Oozie CLI                  

Unit 11: NameNode High Availability & possibilities

  •  HDFS HA Components & its working

  •   Understanding setup and failovers.

Unit 12: Hbase

  •   What are NOSQL databases and why Hbase

  •    Hbase architecture and understanding internals.
  •    Setup of Hbase in different modes
  •    Working with Hbase shell and table operations.
  •    Working with Hbase inbuilt tools & data.
  •    Setup and working with HBASE Admin API using IDE like eclipse.
  •    Monitoring Hbase cluster, load balancing & tuning.                  

Unit 13: Monitoring HDP2 Services/CDH Services

  •   Monitoring Architecture

  •    Monitoring JVM Processes
  •    Understanding JVM Memory

Unit 14: Understanding and working with Kerberos

  •     KDC fundamentals
  •     Setup Kerberos DB, clients and enable Kerberized cluster
  •     Administration issues in Kerberized Cluster.

Unit 15: Understanding and working with Knox

  •   What is Apache Knox API Gateway
  •   Knox gateway and integration with Hadoop ecosystem.
  •   Configuring new services and Federation
  •    Auditing by Knox

Unit 16: Understanding and working with Apache Ranger(HDP) or Sentry(CDH)

  •    Understanding security administration and access controls.

  •    Working with REST API and central UI
  •    Understanding Ranger KMS.
  •    Brief on Auditing of user access and administration
  •    Best Practices and real time scenarios.                  

Unit 15: Sumary  | FAQs |