Basic to Advance Data Science and ML with AWS Sagemaker ( dsbdabes | 35 hours )

Prerequisites:
  • Participants are expected to have a basic understanding of OOPS Concept
  • Any previous experience with development or administration will be a plus
  • Experience with AWS or any other cloud based environment will be a plus

Audience

  • Anyone who would like to learn DataScience
Course Outline:

Discovery?

  • Data preparation
  • Model planning
  • Model building
  • Presentation/Communication of results
  • Operationalization
  • Exercise: Case study

What is Hadoop and basic introduction?

  • Introduction
  • Getting Started
  • Use cases
  • Machine Learning and future
  • Data Science Overview
  • Big Data Overview
  • Data Structures
  • Drivers and complexities of Big Data
  • Big Data ecosystem and a new approach to analytics
  • Key technologies in Big Data
  • Data Mining process and problems
  • Association Pattern Mining
  • Data Clustering
  • Outlier Detection
  • Data Classification
  • Introduction to Data Analytics lifecycle

Getting started with Python?

  • Installing Jupyter notebook
  • Features of Python language
  • Python review and core libraries

Installing Hadoop

  • Understanding Hadoop
  • HDFS
  • MapReduce architecture
  • Hadoop related projects overview
  • Writing programs in Hadoop MapReduce
  • Exercises
  • Using Python and Hadoop

HDFS Basics and Cloudera

  • Introduction to HDFS, YARN and Mysql database setup and installation
  • Prepare AWS AMI for Cloudera Installation
  • Prepare AWS AMI for Cloudera Installation
  • Cloudera Installation Phases and Paths
  • Cloudera Manager Introduction and Overview
  • Parcels and Repository setup with Apachehttpd
  • Cloudera Installation Path B with local repository – AMI and prepare
  • Add Cluster, Add Service and Delete Cluster life cycle

Installing and using Hortonworks Framework

  • Installing a single node cluster of Hortonworks framework
  • Configure a local HDP repository
  • Install HDP using the Ambari install wizard
  • Decommission a node
  • Add a new node to an existing cluster
  • Add an HDP service to a cluster using Ambari
  • Change the configuration of a service using Ambari
  • Configure the location of log files for services

Data preparation steps

  • Feature extraction
  • Data cleaning
  • Data integration and transformation
  • Data reduction – sampling, feature subset selection,
  • Dimensionality reduction
  • Discretization and binning
  • Exercises and Case study
  • Exploratory data analytic methods

Data Analysis

  • Use Spark SQL to interact with the metastore programmatically in your applications. Generate reports by using queries against loaded data.
  • Use metastore tables as an input source or an output sink for Spark applications
  • Understand the fundamentals of querying datasets in Spark
  • Filter data using Spark
  • Write queries that calculate aggregate statistics
  • Join disparate datasets using Spark
  • Produce ranked or sorted data

Descriptive statistics

  • Exploratory data analysis
  • Visualization – preliminary steps
  • Visualizing single variable
  • Examining multiple variables
  • Statistical methods for evaluation
  • Hypothesis testing
  • Exercises and Case study
  • Data Visualizations

Basic visualizations in python

  • Packages for data visualization
  • Advanced graphs
  • Exercises
  • An Introduction to data visualization too like Tabelu
  • Regression (Estimating future values)

AWS Sagemaker

  • Introduction
  • How Amazon SageMaker Works
  • Set Up Amazon SageMaker
  • Get Started with Amazon SageMaker
  • Amazon SageMaker Studio
  • Use Machine Learning Frameworks, Python, and R with Amazon SageMaker
  • Use Amazon SageMaker Autopilot to automate model development
  • Create and Manage Workforces
  • Use Amazon SageMaker Ground Truth for Data Labeling
  • Process Data and Evaluate Models
  • Build Models
  • Train Models
  • Deploy Models
  • Monitor Amazon SageMaker
  • Using Amazon Augmented AI for Human Review
  • Security on Amazon SageMaker
  • Supported Regions and Quotas
  • Crowd HTML Elements Reference

Stats and Machine Learning

  • Linear regression
  • Use cases
  • Model description
  • Diagnostics
  • Problems with linear regression
  • Shrinkage methods, ridge regression, the lasso
  • Generalizations and nonlinearity
  • Regression splines
  • Local polynomial regression
  • Generalized additive models
  • Regression with RHadoop
  • Exercises and Case study
  • Classification

The classification related problems

  • Bayesian refresher
  • Naïve Bayes
  • Logistic regression
  • K-nearest neighbors
  • Decision trees algorithm
  • Neural networks
  • Support vector machines
  • Diagnostics of classifiers
  • Comparison of classification methods
  • Scalable classification algorithms
  • Exercises and Case study
  • Assessing model performance and selection

Bias, Variance and model complexity

  • Accuracy vs Interpretability
  • Evaluating classifiers
  • Measures of model/algorithm performance
  • Hold-out method of validation
  • Cross-validation
  • Tuning machine learning algorithms with caret package
  • Visualizing model performance with Profit ROC and Lift curves
  • Ensemble Methods
  • Bagging
  • Random Forests
  • Boosting
  • Gradient boosting
  • Exercises and Case study
  • Support vector machines for classification and regression

Maximal Margin classifiers

  • Support vector classifiers
  • Support vector machines
  • SVM’s for classification problems
  • SVM’s for regression problems
  • Exercises and Case study
  • Identifying unknown groupings within a data set

Machine Learning using python and spark

  • Learning a Model
  • SVM, Evaluation
  • Decision Trees and Random Forests
  • Classification Redux, Comparing Models
  • Ensemble Methods
  • Best Practices

Feature Selection for Clustering

  • Representative based algorithms: k-means, k-medoids
  • Hierarchical algorithms: agglomerative and divisive methods
  • Probabilistic base algorithms: EM
  • Density based algorithms: DBSCAN, DENCLUE
  • Cluster validation
  • Advanced clustering concepts
  • Clustering with RHadoop
  • Exercises and Case study
  • Discovering connections with Link Analysis

Link analysis concepts

  • Metrics for analyzing networks
  • The Pagerank algorithm
  • Hyperlink-Induced Topic Search
  • Link Prediction
  • Exercises and Case study
  • Association Pattern Mining

Frequent Pattern Mining Model

  • Scalability issues in frequent pattern mining
  • Brute Force algorithms
  • Apriori algorithm
  • The FP growth approach
  • Evaluation of Candidate Rules
  • Applications of Association Rules
  • Validation and Testing
  • Diagnostics
  • Association rules with R and Hadoop
  • Exercises and Case study
  • Constructing recommendation engines

Understanding recommender systems

  • Data mining techniques used in recommender systems
  • Recommender systems with recommenderlab package
  • Evaluating the recommender systems
  • Recommendations with RHadoop
  • Exercise: Building recommendation engine
  • Text analysis

Text analysis steps

  • Collecting raw text
  • Bag of words
  • Term Frequency –Inverse Document Frequency
  • Determining Sentiments
  • Exercises and Case study

Note : Please note we will use a set of tools and Mainly Jupyter Notebook for these calculations also AWS Sagemaker will be used for most of our computations.