Certified Hadoop Developer (CHD)

Course Code: certhadoopdevbes

Duration: 21 hours

Prerequisites:

Participants are expected to have a basic understanding of OOPS Concept
Any previous experience with development or administration will be a plus
Experience with AWS or any other cloud based environment will be a plus

Audience

Developers

Course Outline:

What is Hadoop and basic introduction?

Introduction
Getting Started
Use cases
Machine Learning and future

ISMAC, Lambda Architecture, AWS

Introduction
Lambda Architecture Overview and Details
Use cases
Introduction To Amazon Web Services (AWS)
Signup and Billing (Important - Pricing related)
Zones and Regions
Launch EC2 Instance
Simple Storage Service (S3)
Login to EC2 Instance using Putty
EC2 AMI (Amazon Machine Image)
EC2 Spot Instances
Relational Data Service (RDS)

Pandas, Python, Github and stats

Web Scraping. Regular Expressions. Data Reshaping. Data Cleanup. Pandas.
Exploratory Data Analysis
Scraping, Pandas, Python, and viz
Pandas, SQL, and the Grammar of Data
Statistical Models
Probability, Distributions, and Frequentist Statistics
Bias and Regression
Regression, Logistic Regression: in sklearn and statsmodels
Classification. kNN. Cross Validation. Dimensionality Reduction. PCA. MDS.

HDFS Basics and Cloudera

Introduction to HDFS, YARN and Mysql database setup and installation
Prepare AWS AMI for Cloudera Installation
Prepare AWS AMI for Cloudera Installation
Cloudera Installation Phases and Paths
Cloudera Manager Introduction and Overview
Parcels and Repository setup with Apachehttpd
Cloudera Installation Path B with local repository – AMI and prepare
Add Cluster, Add Service and Delete Cluster life cycle

Spark, Hive and Pig Introduction

In-depth spark concepts
In-depth Hive concepts
In-depth Pig concepts
In-depth Kafka and flume
In-depth other components of Hadoop Ecosystem

Transform, Stage, and Store

Convert a set of data values in a given format stored in HDFS into new data values or a new data format and write them into HDFS.
Load data from HDFS for use in Spark applications
Write the results back into HDFS using Spark
Read and write files in a variety of file formats
Perform standard extract, transform, load (ETL) processes on data using the Spark API
Kafka Integration in Hadoop
Advanced Hadoop KPI
Real-time Data Streaming with Analytics

Data Analysis

Use Spark SQL to interact with the metastore programmatically in your applications. Generate reports by using queries against loaded data.
Use metastore tables as an input source or an output sink for Spark applications
Understand the fundamentals of querying datasets in Spark
Filter data using Spark
Write queries that calculate aggregate statistics
Join disparate datasets using Spark
Produce ranked or sorted data

Machine Learning using python and spark

Learning a Model
SVM, Evaluation
Decision Trees and Random Forests
Classification Redux, Comparing Models
Ensemble Methods
Best Practices

Configuration

This Practical Example will make you familiar with all aspects of generating a result, not just writing code.
Supply command-line options to change your application configuration, such as increasing available memory

Note : Please Note the Practical Example will be Healthcare based and spark, pig, Hive Kafka and Flume will be in great detail.please note we will not dive deep in stats.