Certified Hadoop Developer (CHD) ( certhadoopdevbes | 21 hours )
Prerequisites:
- Participants are expected to have a basic understanding of OOPS Concept
- Any previous experience with development or administration will be a plus
- Experience with AWS or any other cloud based environment will be a plus
Audience
- Developers
Course Outline:
What is Hadoop and basic introduction?
- Introduction
- Getting Started
- Use cases
- Machine Learning and future
ISMAC, Lambda Architecture, AWS
- Introduction
- Lambda Architecture Overview and Details
- Use cases
- Introduction To Amazon Web Services (AWS)
- Signup and Billing (Important - Pricing related)
- Zones and Regions
- Launch EC2 Instance
- Simple Storage Service (S3)
- Login to EC2 Instance using Putty
- EC2 AMI (Amazon Machine Image)
- EC2 Spot Instances
- Relational Data Service (RDS)
Pandas, Python, Github and stats
- Web Scraping. Regular Expressions. Data Reshaping. Data Cleanup. Pandas.
- Exploratory Data Analysis
- Scraping, Pandas, Python, and viz
- Pandas, SQL, and the Grammar of Data
- Statistical Models
- Probability, Distributions, and Frequentist Statistics
- Bias and Regression
- Regression, Logistic Regression: in sklearn and statsmodels
- Classification. kNN. Cross Validation. Dimensionality Reduction. PCA. MDS.
HDFS Basics and Cloudera
- Introduction to HDFS, YARN and Mysql database setup and installation
- Prepare AWS AMI for Cloudera Installation
- Prepare AWS AMI for Cloudera Installation
- Cloudera Installation Phases and Paths
- Cloudera Manager Introduction and Overview
- Parcels and Repository setup with Apachehttpd
- Cloudera Installation Path B with local repository – AMI and prepare
- Add Cluster, Add Service and Delete Cluster life cycle
Spark, Hive and Pig Introduction
- In-depth spark concepts
- In-depth Hive concepts
- In-depth Pig concepts
- In-depth Kafka and flume
- In-depth other components of Hadoop Ecosystem
Transform, Stage, and Store
- Convert a set of data values in a given format stored in HDFS into new data values or a new data format and write them into HDFS.
- Load data from HDFS for use in Spark applications
- Write the results back into HDFS using Spark
- Read and write files in a variety of file formats
- Perform standard extract, transform, load (ETL) processes on data using the Spark API
- Kafka Integration in Hadoop
- Advanced Hadoop KPI
- Real-time Data Streaming with Analytics
Data Analysis
- Use Spark SQL to interact with the metastore programmatically in your applications. Generate reports by using queries against loaded data.
- Use metastore tables as an input source or an output sink for Spark applications
- Understand the fundamentals of querying datasets in Spark
- Filter data using Spark
- Write queries that calculate aggregate statistics
- Join disparate datasets using Spark
- Produce ranked or sorted data
Machine Learning using python and spark
- Learning a Model
- SVM, Evaluation
- Decision Trees and Random Forests
- Classification Redux, Comparing Models
- Ensemble Methods
- Best Practices
Configuration
- This Practical Example will make you familiar with all aspects of generating a result, not just writing code.
- Supply command-line options to change your application configuration, such as increasing available memory
Note : Please Note the Practical Example will be Healthcare based and spark, pig, Hive Kafka and Flume will be in great detail.please note we will not dive deep in stats.