Certified Hadoop Developer (CHD) ( certhadoopdevbes | 21 hours )

Prerequisites:
  • Participants are expected to have a basic understanding of OOPS Concept
  • Any previous experience with development or administration will be a plus
  • Experience with AWS or any other cloud based environment will be a plus

Audience

  • Developers
Course Outline:

What is Hadoop and basic introduction?

  • Introduction
  • Getting Started
  • Use cases
  • Machine Learning and future

ISMAC, Lambda Architecture, AWS

  • Introduction
  • Lambda Architecture Overview and Details
  • Use cases
  • Introduction To Amazon Web Services (AWS)
  • Signup and Billing (Important - Pricing related)
  • Zones and Regions
  • Launch EC2 Instance
  • Simple Storage Service (S3)
  • Login to EC2 Instance using Putty
  • EC2 AMI (Amazon Machine Image)
  • EC2 Spot Instances
  • Relational Data Service (RDS)

Pandas, Python, Github and stats

  • Web Scraping. Regular Expressions. Data Reshaping. Data Cleanup. Pandas.
  • Exploratory Data Analysis
  • Scraping, Pandas, Python, and viz
  • Pandas, SQL, and the Grammar of Data
  • Statistical Models
  • Probability, Distributions, and Frequentist Statistics
  • Bias and Regression
  • Regression, Logistic Regression: in sklearn and statsmodels
  • Classification. kNN. Cross Validation. Dimensionality Reduction. PCA. MDS.

HDFS Basics and Cloudera

  • Introduction to HDFS, YARN and Mysql database setup and installation
  • Prepare AWS AMI for Cloudera Installation
  • Prepare AWS AMI for Cloudera Installation
  • Cloudera Installation Phases and Paths
  • Cloudera Manager Introduction and Overview
  • Parcels and Repository setup with Apachehttpd
  • Cloudera Installation Path B with local repository – AMI and prepare
  • Add Cluster, Add Service and Delete Cluster life cycle

Spark, Hive and Pig Introduction

  • In-depth spark concepts
  • In-depth Hive concepts
  • In-depth Pig concepts
  • In-depth Kafka and flume
  • In-depth other components of Hadoop Ecosystem

Transform, Stage, and Store

  • Convert a set of data values in a given format stored in HDFS into new data values or a new data format and write them into HDFS.
  • Load data from HDFS for use in Spark applications
  • Write the results back into HDFS using Spark
  • Read and write files in a variety of file formats
  • Perform standard extract, transform, load (ETL) processes on data using the Spark API
  • Kafka Integration in Hadoop
  • Advanced Hadoop KPI
  • Real-time Data Streaming with Analytics

Data Analysis

  • Use Spark SQL to interact with the metastore programmatically in your applications. Generate reports by using queries against loaded data.
  • Use metastore tables as an input source or an output sink for Spark applications
  • Understand the fundamentals of querying datasets in Spark
  • Filter data using Spark
  • Write queries that calculate aggregate statistics
  • Join disparate datasets using Spark
  • Produce ranked or sorted data

Machine Learning using python and spark

  • Learning a Model
  • SVM, Evaluation
  • Decision Trees and Random Forests
  • Classification Redux, Comparing Models
  • Ensemble Methods
  • Best Practices

Configuration

  • This Practical Example will make you familiar with all aspects of generating a result, not just writing code.
  • Supply command-line options to change your application configuration, such as increasing available memory

Note : Please Note the Practical Example will be Healthcare based and spark, pig, Hive Kafka and Flume will be in great detail.please note we will not dive deep in stats.