Basic to Advance Data Science and ML with AWS Sagemaker ( dsbdabes | 35 hours )
Prerequisites:
- Participants are expected to have a basic understanding of OOPS Concept
- Any previous experience with development or administration will be a plus
- Experience with AWS or any other cloud based environment will be a plus
Audience
- Anyone who would like to learn DataScience
Course Outline:
Discovery?
- Data preparation
- Model planning
- Model building
- Presentation/Communication of results
- Operationalization
- Exercise: Case study
What is Hadoop and basic introduction?
- Introduction
- Getting Started
- Use cases
- Machine Learning and future
- Data Science Overview
- Big Data Overview
- Data Structures
- Drivers and complexities of Big Data
- Big Data ecosystem and a new approach to analytics
- Key technologies in Big Data
- Data Mining process and problems
- Association Pattern Mining
- Data Clustering
- Outlier Detection
- Data Classification
- Introduction to Data Analytics lifecycle
Getting started with Python?
- Installing Jupyter notebook
- Features of Python language
- Python review and core libraries
Installing Hadoop
- Understanding Hadoop
- HDFS
- MapReduce architecture
- Hadoop related projects overview
- Writing programs in Hadoop MapReduce
- Exercises
- Using Python and Hadoop
HDFS Basics and Cloudera
- Introduction to HDFS, YARN and Mysql database setup and installation
- Prepare AWS AMI for Cloudera Installation
- Prepare AWS AMI for Cloudera Installation
- Cloudera Installation Phases and Paths
- Cloudera Manager Introduction and Overview
- Parcels and Repository setup with Apachehttpd
- Cloudera Installation Path B with local repository – AMI and prepare
- Add Cluster, Add Service and Delete Cluster life cycle
Installing and using Hortonworks Framework
- Installing a single node cluster of Hortonworks framework
- Configure a local HDP repository
- Install HDP using the Ambari install wizard
- Decommission a node
- Add a new node to an existing cluster
- Add an HDP service to a cluster using Ambari
- Change the configuration of a service using Ambari
- Configure the location of log files for services
Data preparation steps
- Feature extraction
- Data cleaning
- Data integration and transformation
- Data reduction – sampling, feature subset selection,
- Dimensionality reduction
- Discretization and binning
- Exercises and Case study
- Exploratory data analytic methods
Data Analysis
- Use Spark SQL to interact with the metastore programmatically in your applications. Generate reports by using queries against loaded data.
- Use metastore tables as an input source or an output sink for Spark applications
- Understand the fundamentals of querying datasets in Spark
- Filter data using Spark
- Write queries that calculate aggregate statistics
- Join disparate datasets using Spark
- Produce ranked or sorted data
Descriptive statistics
- Exploratory data analysis
- Visualization – preliminary steps
- Visualizing single variable
- Examining multiple variables
- Statistical methods for evaluation
- Hypothesis testing
- Exercises and Case study
- Data Visualizations
Basic visualizations in python
- Packages for data visualization
- Advanced graphs
- Exercises
- An Introduction to data visualization too like Tabelu
- Regression (Estimating future values)
AWS Sagemaker
- Introduction
- How Amazon SageMaker Works
- Set Up Amazon SageMaker
- Get Started with Amazon SageMaker
- Amazon SageMaker Studio
- Use Machine Learning Frameworks, Python, and R with Amazon SageMaker
- Use Amazon SageMaker Autopilot to automate model development
- Create and Manage Workforces
- Use Amazon SageMaker Ground Truth for Data Labeling
- Process Data and Evaluate Models
- Build Models
- Train Models
- Deploy Models
- Monitor Amazon SageMaker
- Using Amazon Augmented AI for Human Review
- Security on Amazon SageMaker
- Supported Regions and Quotas
- Crowd HTML Elements Reference
Stats and Machine Learning
- Linear regression
- Use cases
- Model description
- Diagnostics
- Problems with linear regression
- Shrinkage methods, ridge regression, the lasso
- Generalizations and nonlinearity
- Regression splines
- Local polynomial regression
- Generalized additive models
- Regression with RHadoop
- Exercises and Case study
- Classification
The classification related problems
- Bayesian refresher
- Naïve Bayes
- Logistic regression
- K-nearest neighbors
- Decision trees algorithm
- Neural networks
- Support vector machines
- Diagnostics of classifiers
- Comparison of classification methods
- Scalable classification algorithms
- Exercises and Case study
- Assessing model performance and selection
Bias, Variance and model complexity
- Accuracy vs Interpretability
- Evaluating classifiers
- Measures of model/algorithm performance
- Hold-out method of validation
- Cross-validation
- Tuning machine learning algorithms with caret package
- Visualizing model performance with Profit ROC and Lift curves
- Ensemble Methods
- Bagging
- Random Forests
- Boosting
- Gradient boosting
- Exercises and Case study
- Support vector machines for classification and regression
Maximal Margin classifiers
- Support vector classifiers
- Support vector machines
- SVM’s for classification problems
- SVM’s for regression problems
- Exercises and Case study
- Identifying unknown groupings within a data set
Machine Learning using python and spark
- Learning a Model
- SVM, Evaluation
- Decision Trees and Random Forests
- Classification Redux, Comparing Models
- Ensemble Methods
- Best Practices
Feature Selection for Clustering
- Representative based algorithms: k-means, k-medoids
- Hierarchical algorithms: agglomerative and divisive methods
- Probabilistic base algorithms: EM
- Density based algorithms: DBSCAN, DENCLUE
- Cluster validation
- Advanced clustering concepts
- Clustering with RHadoop
- Exercises and Case study
- Discovering connections with Link Analysis
Link analysis concepts
- Metrics for analyzing networks
- The Pagerank algorithm
- Hyperlink-Induced Topic Search
- Link Prediction
- Exercises and Case study
- Association Pattern Mining
Frequent Pattern Mining Model
- Scalability issues in frequent pattern mining
- Brute Force algorithms
- Apriori algorithm
- The FP growth approach
- Evaluation of Candidate Rules
- Applications of Association Rules
- Validation and Testing
- Diagnostics
- Association rules with R and Hadoop
- Exercises and Case study
- Constructing recommendation engines
Understanding recommender systems
- Data mining techniques used in recommender systems
- Recommender systems with recommenderlab package
- Evaluating the recommender systems
- Recommendations with RHadoop
- Exercise: Building recommendation engine
- Text analysis
Text analysis steps
- Collecting raw text
- Bag of words
- Term Frequency –Inverse Document Frequency
- Determining Sentiments
- Exercises and Case study
Note : Please note we will use a set of tools and Mainly Jupyter Notebook for these calculations also AWS Sagemaker will be used for most of our computations.