Course Code: hdpded
Duration: 35 hours
Prerequisites:
  • An understanding of Hadoop and big data
  • An understanding of Spark
  • Familiarity with the command line
  • System administration experience
Overview:

Hortonworks Data Platform is an open-source Apache Hadoop support platform that provides a stable foundation for developing big data solutions on the Apache Hadoop ecosystem.

This instructor-led live training introduces Hortonworks and walks participants through the deployment of Spark + Hadoop solution.

By the end of this training, participants will be able to:

  • Use Hortonworks to reliably run Hadoop at a large scale
  • Unify Hadoop's security, governance, and operations capabilities with Spark's agile analytic workflows.
  • Use Hortonworks to investigate, validate, certify and support each of the components in a Spark project
  • Process different types of data, including structured, unstructured, in-motion, and at-rest.

Audience

  • Hadoop administrators

Format of the course

  • Part lecture, part discussion, exercises and heavy hands-on practice
Course Outline:

Day 1:

On day 1 we will start with exploring existing knowledge of participants on Big Data, Understand Hadoop as a framework and its distributions & versions, some use cases,  understand Hadoop internals, components & its architecture.

Unit 1: What is Big Data & why Big data?

A Market for Big Data and trends in current market.

Most Common New Types of Data

What is Hadoop & its components.  Comparison of different distributions.

Traditional Systems vs. Hadoop

Introduction to HDP, Ambari and Hadoop 2.x

(HDP Version to be discussed to be decided as per Client requirement)

Overview of a Hadoop Cluster 

The Hortonworks Data Platform

Real time Use Cases

Unit 2: Hadoop ecosystem and services.

Understanding HDFS Architecture

Understanding Block Storage

Understanding Ambari & its architecture.

Undertanding why Ambari and how to use it.

A brief on HDFS Clients, EDGE nodes, Master & worker nodes.

Understanding reads, writes on HDFS and related issues.

We will also discuss on various ways of setting up Hadoop cluster, Understand prerequisites, Discuss on cluster planning & related variables, Learn How to setup Cluster Hortonworks on VM machines/Amazon web services.

Unit 3: Installation Prerequisites and Planning

Minimum Hardware Requirements

Minimum Software Requirements

Understanding variables in context of cluster setup & brief on cluster planning.

Lab 3.1: Setting up the Environment

Lab 3.2: Install HDP 2.x/HDP 3.x Cluster using Ambari

Learning what can be done from management console and terminals.

Day 2:

On day 2, continuation of Hadoop cluster setup, deployment and management,

                Understand internals and problem scenarios & solving issues.

Unit 4: Configuring Hadoop

Configuration Considerations for Hadoop & services.

Understanding internals of a running cluster – Block pool ID, ClusterID, NSID,

Simulating scenarios of Over/Under replicated blocks/corrupt blocks, understanding

, replication rules, different replication possibilities, rack awareness, secondarynamenode’s

Checkpointing Process , Building of simple monitoring scripts, Automation possibilities. tasks and capturing information.

Configuration via Ambari Management Monitoring REST API

                Working with Hadoop cluster :

Lab 4.1: Commisioning/Decommisioning of nodes in Cluster

Lab 4.2: Stopping, Starting, adding & removing HDP Services

Lab 4.3: Using HDFS Commands & services.

  • Lab 4.4: Analyzing problems and resolving them : Some examples from live real time environments

A detailed discussion on cluster planning & choice of services to be part of

                Running cluster, Understanding processing framework & its comparison with data

                Processing services.

Unit 5 : Discussion on cluster planning

                Industry scenarios and brief on hardware/Software recommendations.

                Deciding on right services to be part of your cluster.

Unit 6: Understanding YARN Architecture and MapReduce

What is Mapreduce?

What is YARN?

Beyond MapReduce & brief on services like Apache Spark, Kafka for NRT analytics

Understanding in memory computing frameworks-Spark

YARN Use-case

Lifecycle of a YARN Application & its components .

Day 3:

Day 3: On day 3, we deep dive in understanding working of processing frameworks,

                Troubleshooting issues, cluster testing, working with schedulers & monitoring.

Unit 7: Configuring YARN  & MapReduce

Lab 6.1: Running & Troubleshooting MapReduce Jobs

Lab 6.2: Running jobs for cluster stress testing and performance testing

Lab 6.2: Running jobs through terminal/web based UI like Hue.

Job Schedulers

Overview of Job Scheduling

The Built-in Schedulers

Understanding FIFO, Fair Scheduler and Capacity Scheduler.

Configuring the Schedulers and considerations for performance optimization.

Defining Pools & Queues and working with them.

Lab 7.1: Configuring the Capacity Scheduler & fair scheduler

Lab 7.2: Monitoring Applications/jobs using UI or other interfaces.

We discuss on security aspects including User management at HDFS level, working with ACLS & understanding Kerberos, use cases.

Unit 8: Ensuring Data Integrity

Ensuring Data Integrity Replication Placement

                Limiting data and count of objects using Quotas

                Limiting read/write access to cluster

                Understanding ACLS: For services and Mapreduce jobs

                Understanding Kerberos

                Understanding User administration and controlling access to cluster.

We will discuss on Data movement in and across clusters, data ingestion  options and requirements, safeguarding cluster data by backups/snapshots.

Unit 9: Enterprise Data Movement

Enterprise Data Movement

Challenges with a Traditional ETL Platform

Data Ingestion & options

Distributed Copy (distcp) Command

Distcp Options , Using distcp & automating ideas.

Lab 9.1: Use distcp to Copy Data from a Remote Cluster to 1 or many clusters.

Unit 10: Backup and Recovery

What should you backup?

HDFS Snapshots, HDFS Data - Backups

Lab 17.1: Using HDFS Snapshots

Day 4:

On day 4 We learn on web services and interfaces, their usage , data transfer from

                External sources into HDFS, import/export tools like sqoop & data ingestion using                 Flume & kafka.

Unit 11: HDFS Web Services, WebHDFS & HUE

Hadoop HDFS over HTTP

Unit 12: Transferring Data 

Overview of Sqoop & its capabilities

The Sqoop Import/Export tool

Importing a Table/ Specific Columns / from a Query & Export Tool

Flume

Flume Introduction & Installing Flume

Flume Events , Sources , Sinks, Channels & selectors

               

We start learning on datwarehousing package Hive, its features &  Capabilities, working with hive,data & administration,  understanding work             Flow scheduling using oozie and alternatives.

Unit 13: Hive Administration

Introduction to Hive

Comparing Hive with RDBMS

Hive Services : CLI, MetaStore  & HiveServer2

Hive Setup in different ways and working with CLI & client

Working with Hive SQL Statements & usage of collection/primitive datatypes

Working with Hive Structures : Hive Tables /Indexes  & data

Day 5:

Working with functions, joins,  various hive properties and optimizations

Understanding partitioning, bucketing

Hive/MR versus Hive/Tez

Understanding compression and different formats supported

                Understanding working with Hive using IDEs

Unit 14: Workflow scheduling using Oozie

Oozie Overview & its Components

Jobs, Workflows, Coordinators, Bundles

Workflow Actions and Decisions

Oozie Actions , Job Submission

Oozie Server Workflow Coordinator

Oozie Console & Interfaces to Oozie

Oozie Server Configuration

Oozie Scripts & Using the Oozie CLI

                Overview on Azkaban

We start exploring heights of Hadoop 2.x feature like High Availability

                 & Federation, its working and benefits, Learn on Nosql databases and Hbase.

Unit 15: NameNode HA

NameNode Architecture HDP1

NameNode High Availability & possibilities

HDFS HA Components & its working

                Understanding setup and failovers.

Unit 16: Hbase

                   What are NOSQL databases and why Hbase

                   Hbase architecture and understanding internals.

                   Setup of Hbase in different modes

                   Working with Hbase shell and table operations.

                   Working with Hbase inbuilt tools & data.

                   Setup and working with HBASE Admin API using IDE like eclipse.

                   Monitoring Hbase cluster, load balancing & tuning.

Now that we have covered and learnt Hadoop administration, Working of its components, we will understand monitoring & tools/services

Unit 17: Monitoring HDP2 Services using Ambari

Monitoring Architecture

Monitoring HDP2 Clusters

Ambari Web Interface Ambari Web Interface (cont.)

Brief Ganglia, Nagios & Inbuild available monitoring services.

Monitoring JVM

Unit 19: Sumary | New features in HDP | FAQs |