Basic to Advance Data Science and ML with AWS Sagemaker

Course Code: dsbdabes

Duration: 35 hours

Prerequisites:

Participants are expected to have a basic understanding of OOPS Concept
Any previous experience with development or administration will be a plus
Experience with AWS or any other cloud based environment will be a plus

Audience

Anyone who would like to learn DataScience

Course Outline:

Discovery?

Data preparation
Model planning
Model building
Presentation/Communication of results
Operationalization
Exercise: Case study

What is Hadoop and basic introduction?

Introduction
Getting Started
Use cases
Machine Learning and future
Data Science Overview
Big Data Overview
Data Structures
Drivers and complexities of Big Data
Big Data ecosystem and a new approach to analytics
Key technologies in Big Data
Data Mining process and problems
Association Pattern Mining
Data Clustering
Outlier Detection
Data Classification
Introduction to Data Analytics lifecycle

Getting started with Python?

Installing Jupyter notebook
Features of Python language
Python review and core libraries

Installing Hadoop

Understanding Hadoop
HDFS
MapReduce architecture
Hadoop related projects overview
Writing programs in Hadoop MapReduce
Exercises
Using Python and Hadoop

HDFS Basics and Cloudera

Introduction to HDFS, YARN and Mysql database setup and installation
Prepare AWS AMI for Cloudera Installation
Prepare AWS AMI for Cloudera Installation
Cloudera Installation Phases and Paths
Cloudera Manager Introduction and Overview
Parcels and Repository setup with Apachehttpd
Cloudera Installation Path B with local repository – AMI and prepare
Add Cluster, Add Service and Delete Cluster life cycle

Installing and using Hortonworks Framework

Installing a single node cluster of Hortonworks framework
Configure a local HDP repository
Install HDP using the Ambari install wizard
Decommission a node
Add a new node to an existing cluster
Add an HDP service to a cluster using Ambari
Change the configuration of a service using Ambari
Configure the location of log files for services

Data preparation steps

Feature extraction
Data cleaning
Data integration and transformation
Data reduction – sampling, feature subset selection,
Dimensionality reduction
Discretization and binning
Exercises and Case study
Exploratory data analytic methods

Data Analysis

Use Spark SQL to interact with the metastore programmatically in your applications. Generate reports by using queries against loaded data.
Use metastore tables as an input source or an output sink for Spark applications
Understand the fundamentals of querying datasets in Spark
Filter data using Spark
Write queries that calculate aggregate statistics
Join disparate datasets using Spark
Produce ranked or sorted data

Descriptive statistics

Exploratory data analysis
Visualization – preliminary steps
Visualizing single variable
Examining multiple variables
Statistical methods for evaluation
Hypothesis testing
Exercises and Case study
Data Visualizations

Basic visualizations in python

Packages for data visualization
Advanced graphs
Exercises
An Introduction to data visualization too like Tabelu
Regression (Estimating future values)

AWS Sagemaker

Introduction
How Amazon SageMaker Works
Set Up Amazon SageMaker
Get Started with Amazon SageMaker
Amazon SageMaker Studio
Use Machine Learning Frameworks, Python, and R with Amazon SageMaker
Use Amazon SageMaker Autopilot to automate model development
Create and Manage Workforces
Use Amazon SageMaker Ground Truth for Data Labeling
Process Data and Evaluate Models
Build Models
Train Models
Deploy Models
Monitor Amazon SageMaker
Using Amazon Augmented AI for Human Review
Security on Amazon SageMaker
Supported Regions and Quotas
Crowd HTML Elements Reference

Stats and Machine Learning

Linear regression
Use cases
Model description
Diagnostics
Problems with linear regression
Shrinkage methods, ridge regression, the lasso
Generalizations and nonlinearity
Regression splines
Local polynomial regression
Generalized additive models
Regression with RHadoop
Exercises and Case study
Classification

The classification related problems

Bayesian refresher
Naïve Bayes
Logistic regression
K-nearest neighbors
Decision trees algorithm
Neural networks
Support vector machines
Diagnostics of classifiers
Comparison of classification methods
Scalable classification algorithms
Exercises and Case study
Assessing model performance and selection

Bias, Variance and model complexity

Accuracy vs Interpretability
Evaluating classifiers
Measures of model/algorithm performance
Hold-out method of validation
Cross-validation
Tuning machine learning algorithms with caret package
Visualizing model performance with Profit ROC and Lift curves
Ensemble Methods
Bagging
Random Forests
Boosting
Gradient boosting
Exercises and Case study
Support vector machines for classification and regression

Maximal Margin classifiers

Support vector classifiers
Support vector machines
SVM’s for classification problems
SVM’s for regression problems
Exercises and Case study
Identifying unknown groupings within a data set

Machine Learning using python and spark

Learning a Model
SVM, Evaluation
Decision Trees and Random Forests
Classification Redux, Comparing Models
Ensemble Methods
Best Practices

Feature Selection for Clustering

Representative based algorithms: k-means, k-medoids
Hierarchical algorithms: agglomerative and divisive methods
Probabilistic base algorithms: EM
Density based algorithms: DBSCAN, DENCLUE
Cluster validation
Advanced clustering concepts
Clustering with RHadoop
Exercises and Case study
Discovering connections with Link Analysis

Link analysis concepts

Metrics for analyzing networks
The Pagerank algorithm
Hyperlink-Induced Topic Search
Link Prediction
Exercises and Case study
Association Pattern Mining

Frequent Pattern Mining Model

Scalability issues in frequent pattern mining
Brute Force algorithms
Apriori algorithm
The FP growth approach
Evaluation of Candidate Rules
Applications of Association Rules
Validation and Testing
Diagnostics
Association rules with R and Hadoop
Exercises and Case study
Constructing recommendation engines

Understanding recommender systems

Data mining techniques used in recommender systems
Recommender systems with recommenderlab package
Evaluating the recommender systems
Recommendations with RHadoop
Exercise: Building recommendation engine
Text analysis

Text analysis steps

Collecting raw text
Bag of words
Term Frequency –Inverse Document Frequency
Determining Sentiments
Exercises and Case study

Note : Please note we will use a set of tools and Mainly Jupyter Notebook for these calculations also AWS Sagemaker will be used for most of our computations.