- An understanding of Hadoop and big data
- An understanding of Spark
- Familiarity with the command line
- System administration experience
Hortonworks Data Platform is an open-source Apache Hadoop support platform that provides a stable foundation for developing big data solutions on the Apache Hadoop ecosystem.
This instructor-led live training introduces Hortonworks and walks participants through the deployment of Spark + Hadoop solution.
By the end of this training, participants will be able to:
- Use Hortonworks to reliably run Hadoop at a large scale
- Unify Hadoop's security, governance, and operations capabilities with Spark's agile analytic workflows.
- Use Hortonworks to investigate, validate, certify and support each of the components in a Spark project
- Process different types of data, including structured, unstructured, in-motion, and at-rest.
Audience
- Hadoop administrators
Format of the course
- Part lecture, part discussion, exercises and heavy hands-on practice
Day 1:
On day 1 we will start with exploring existing knowledge of participants on Big Data, Understand Hadoop as a framework and its distributions & versions, some use cases, understand Hadoop internals, components & its architecture.
Unit 1: What is Big Data & why Big data?
A Market for Big Data and trends in current market.
Most Common New Types of Data
What is Hadoop & its components. Comparison of different distributions.
Traditional Systems vs. Hadoop
Introduction to HDP, Ambari and Hadoop 2.x
(HDP Version to be discussed to be decided as per Client requirement)
Overview of a Hadoop Cluster
The Hortonworks Data Platform
Real time Use Cases
Unit 2: Hadoop ecosystem and services.
Understanding HDFS Architecture
Understanding Block Storage
Understanding Ambari & its architecture.
Undertanding why Ambari and how to use it.
A brief on HDFS Clients, EDGE nodes, Master & worker nodes.
Understanding reads, writes on HDFS and related issues.
We will also discuss on various ways of setting up Hadoop cluster, Understand prerequisites, Discuss on cluster planning & related variables, Learn How to setup Cluster Hortonworks on VM machines/Amazon web services.
Unit 3: Installation Prerequisites and Planning
Minimum Hardware Requirements
Minimum Software Requirements
Understanding variables in context of cluster setup & brief on cluster planning.
Lab 3.1: Setting up the Environment
Lab 3.2: Install HDP 2.x/HDP 3.x Cluster using Ambari
Learning what can be done from management console and terminals.
Day 2:
On day 2, continuation of Hadoop cluster setup, deployment and management,
Understand internals and problem scenarios & solving issues.
Unit 4: Configuring Hadoop
Configuration Considerations for Hadoop & services.
Understanding internals of a running cluster – Block pool ID, ClusterID, NSID,
Simulating scenarios of Over/Under replicated blocks/corrupt blocks, understanding
, replication rules, different replication possibilities, rack awareness, secondarynamenode’s
Checkpointing Process , Building of simple monitoring scripts, Automation possibilities. tasks and capturing information.
Configuration via Ambari Management Monitoring REST API
Working with Hadoop cluster :
Lab 4.1: Commisioning/Decommisioning of nodes in Cluster
Lab 4.2: Stopping, Starting, adding & removing HDP Services
Lab 4.3: Using HDFS Commands & services.
- Lab 4.4: Analyzing problems and resolving them : Some examples from live real time environments
A detailed discussion on cluster planning & choice of services to be part of
Running cluster, Understanding processing framework & its comparison with data
Processing services.
Unit 5 : Discussion on cluster planning
Industry scenarios and brief on hardware/Software recommendations.
Deciding on right services to be part of your cluster.
Unit 6: Understanding YARN Architecture and MapReduce
What is Mapreduce?
What is YARN?
Beyond MapReduce & brief on services like Apache Spark, Kafka for NRT analytics
Understanding in memory computing frameworks-Spark
YARN Use-case
Lifecycle of a YARN Application & its components .
Day 3:
Day 3: On day 3, we deep dive in understanding working of processing frameworks,
Troubleshooting issues, cluster testing, working with schedulers & monitoring.
Unit 7: Configuring YARN & MapReduce
Lab 6.1: Running & Troubleshooting MapReduce Jobs
Lab 6.2: Running jobs for cluster stress testing and performance testing
Lab 6.2: Running jobs through terminal/web based UI like Hue.
Job Schedulers
Overview of Job Scheduling
The Built-in Schedulers
Understanding FIFO, Fair Scheduler and Capacity Scheduler.
Configuring the Schedulers and considerations for performance optimization.
Defining Pools & Queues and working with them.
Lab 7.1: Configuring the Capacity Scheduler & fair scheduler
Lab 7.2: Monitoring Applications/jobs using UI or other interfaces.
We discuss on security aspects including User management at HDFS level, working with ACLS & understanding Kerberos, use cases.
Unit 8: Ensuring Data Integrity
Ensuring Data Integrity Replication Placement
Limiting data and count of objects using Quotas
Limiting read/write access to cluster
Understanding ACLS: For services and Mapreduce jobs
Understanding Kerberos
Understanding User administration and controlling access to cluster.
We will discuss on Data movement in and across clusters, data ingestion options and requirements, safeguarding cluster data by backups/snapshots.
Unit 9: Enterprise Data Movement
Enterprise Data Movement
Challenges with a Traditional ETL Platform
Data Ingestion & options
Distributed Copy (distcp) Command
Distcp Options , Using distcp & automating ideas.
Lab 9.1: Use distcp to Copy Data from a Remote Cluster to 1 or many clusters.
Unit 10: Backup and Recovery
What should you backup?
HDFS Snapshots, HDFS Data - Backups
Lab 17.1: Using HDFS Snapshots
Day 4:
On day 4 We learn on web services and interfaces, their usage , data transfer from
External sources into HDFS, import/export tools like sqoop & data ingestion using Flume & kafka.
Unit 11: HDFS Web Services, WebHDFS & HUE
Hadoop HDFS over HTTP
Unit 12: Transferring Data
Overview of Sqoop & its capabilities
The Sqoop Import/Export tool
Importing a Table/ Specific Columns / from a Query & Export Tool
Flume
Flume Introduction & Installing Flume
Flume Events , Sources , Sinks, Channels & selectors
We start learning on datwarehousing package Hive, its features & Capabilities, working with hive,data & administration, understanding work Flow scheduling using oozie and alternatives.
Unit 13: Hive Administration
Introduction to Hive
Comparing Hive with RDBMS
Hive Services : CLI, MetaStore & HiveServer2
Hive Setup in different ways and working with CLI & client
Working with Hive SQL Statements & usage of collection/primitive datatypes
Working with Hive Structures : Hive Tables /Indexes & data
Day 5:
Working with functions, joins, various hive properties and optimizations
Understanding partitioning, bucketing
Hive/MR versus Hive/Tez
Understanding compression and different formats supported
Understanding working with Hive using IDEs
Unit 14: Workflow scheduling using Oozie
Oozie Overview & its Components
Jobs, Workflows, Coordinators, Bundles
Workflow Actions and Decisions
Oozie Actions , Job Submission
Oozie Server Workflow Coordinator
Oozie Console & Interfaces to Oozie
Oozie Server Configuration
Oozie Scripts & Using the Oozie CLI
Overview on Azkaban
We start exploring heights of Hadoop 2.x feature like High Availability
& Federation, its working and benefits, Learn on Nosql databases and Hbase.
Unit 15: NameNode HA
NameNode Architecture HDP1
NameNode High Availability & possibilities
HDFS HA Components & its working
Understanding setup and failovers.
Unit 16: Hbase
What are NOSQL databases and why Hbase
Hbase architecture and understanding internals.
Setup of Hbase in different modes
Working with Hbase shell and table operations.
Working with Hbase inbuilt tools & data.
Setup and working with HBASE Admin API using IDE like eclipse.
Monitoring Hbase cluster, load balancing & tuning.
Now that we have covered and learnt Hadoop administration, Working of its components, we will understand monitoring & tools/services
Unit 17: Monitoring HDP2 Services using Ambari
Monitoring Architecture
Monitoring HDP2 Clusters
Ambari Web Interface Ambari Web Interface (cont.)
Brief Ganglia, Nagios & Inbuild available monitoring services.
Monitoring JVM
Unit 19: Sumary | New features in HDP | FAQs |