Course Code: dsbda
Duration: 35 hours
Overview:

Big data is data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating and information privacy.

Course Outline:

Introduction to Data Science for Big Data Analytics

  • Data Science Overview
  • Big Data Overview
  • Data Structures
  • Drivers and complexities of Big Data
  • Big Data ecosystem and a new approach to analytics
  • Key technologies in Big Data
  • Data Mining process and problems
    • Association Pattern Mining
    • Data Clustering
    • Outlier Detection
    • Data Classification

Introduction to Data Analytics lifecycle

  • Discovery
  • Data preparation
  • Model planning
  • Model building
  • Presentation/Communication of results
  • Operationalization
  • Exercise: Case study

From this point most of the training time (80%) will be spent on examples and exercises in R and related big data technology.

Getting started with R

  • Installing R and Rstudio
  • Features of R language
  • Objects in R
  • Data in R
  • Data manipulation
  • Big data issues
  • Exercises

Getting started with Hadoop

  • Installing Hadoop
  • Understanding Hadoop modes
  • HDFS
  • MapReduce architecture
  • Hadoop related projects overview
  • Writing programs in Hadoop MapReduce
  • Exercises

Integrating R and Hadoop with RHadoop

  • Components of RHadoop
  • Installing RHadoop and connecting with Hadoop
  • The architecture of RHadoop
  • Hadoop streaming with R
  • Data analytics problem solving with RHadoop
  • Exercises

Pre-processing and preparing data

  • Data preparation steps
  • Feature extraction
  • Data cleaning
  • Data integration and transformation
  • Data reduction – sampling, feature subset selection,
  • Dimensionality reduction
  • Discretization and binning
  • Exercises and Case study

Exploratory data analytic methods in R

  • Descriptive statistics
  • Exploratory data analysis
  • Visualization – preliminary steps
  • Visualizing single variable
  • Examining multiple variables
  • Statistical methods for evaluation
  • Hypothesis testing
  • Exercises and Case study

Data Visualizations

  • Basic visualizations in R
  • Packages for data visualization ggplot2, lattice, plotly, lattice
  • Formatting plots in R
  • Advanced graphs
  • Exercises

Regression (Estimating future values)

  • Linear regression
  • Use cases
  • Model description
  • Diagnostics
  • Problems with linear regression
  • Shrinkage methods, ridge regression, the lasso
  • Generalizations and nonlinearity
  • Regression splines
  • Local polynomial regression
  • Generalized additive models
  • Regression with RHadoop
  • Exercises and Case study

Classification

  • The classification related problems
  • Bayesian refresher
  • Naïve Bayes
  • Logistic regression
  • K-nearest neighbors
  • Decision trees algorithm
  • Neural networks
  • Support vector machines
  • Diagnostics of classifiers
  • Comparison of classification methods
  • Scalable classification algorithms
  • Exercises and Case study

Assessing model performance and selection

  • Bias, Variance and model complexity
  • Accuracy vs Interpretability
  • Evaluating classifiers
  • Measures of model/algorithm performance
  • Hold-out method of validation
  • Cross-validation
  • Tuning machine learning algorithms with caret package
  • Visualizing model performance with Profit ROC and Lift curves

Ensemble Methods

  • Bagging
  • Random Forests
  • Boosting
  • Gradient boosting
  • Exercises and Case study

Support vector machines for classification and regression

  • Maximal Margin classifiers
    • Support vector classifiers
    • Support vector machines
    • SVM’s for classification problems
    • SVM’s for regression problems
  • Exercises and Case study

Identifying unknown groupings within a data set

  • Feature Selection for Clustering
  • Representative based algorithms: k-means, k-medoids
  • Hierarchical algorithms: agglomerative and divisive methods
  • Probabilistic base algorithms: EM
  • Density based algorithms: DBSCAN, DENCLUE
  • Cluster validation
  • Advanced clustering concepts
  • Clustering with RHadoop
  • Exercises and Case study

Discovering connections with Link Analysis

  • Link analysis concepts
  • Metrics for analyzing networks
  • The Pagerank algorithm
  • Hyperlink-Induced Topic Search
  • Link Prediction
  • Exercises and Case study

Association Pattern Mining

  • Frequent Pattern Mining Model
  • Scalability issues in frequent pattern mining
  • Brute Force algorithms
  • Apriori algorithm
  • The FP growth approach
  • Evaluation of Candidate Rules
  • Applications of Association Rules
  • Validation and Testing
  • Diagnostics
  • Association rules with R and Hadoop
  • Exercises and Case study

Constructing recommendation engines

  • Understanding recommender systems
  • Data mining techniques used in recommender systems
  • Recommender systems with recommenderlab package
  • Evaluating the recommender systems
  • Recommendations with RHadoop
  • Exercise: Building recommendation engine

Text analysis

  • Text analysis steps
  • Collecting raw text
  • Bag of words
  • Term Frequency –Inverse Document Frequency
  • Determining Sentiments
  • Exercises and Case study
Sites Published:

United Arab Emirates - Data Science for Big Data Analytics

Qatar - Data Science for Big Data Analytics

Egypt - Data Science for Big Data Analytics

Saudi Arabia - Data Science for Big Data Analytics

South Africa - Data Science for Big Data Analytics

Brasil - Data Science for Big Data Analytics

Canada - Data Science for Big Data Analytics

中国 - Data Science for Big Data Analytics

香港 - Data Science for Big Data Analytics

澳門 - Data Science for Big Data Analytics

台灣 - Data Science for Big Data Analytics

USA - Data Science for Big Data Analytics

Österreich - Data Science for Big Data Analytics

Schweiz - Data Science for Big Data Analytics

Deutschland - Data Science for Big Data Analytics

Czech Republic - Data Science for Big Data Analytics

Denmark - Data Science for Big Data Analytics

Estonia - Data Science for Big Data Analytics

Finland - Data Science for Big Data Analytics

Greece - Data Science for Big Data Analytics

Magyarország - Data Science for Big Data Analytics

Ireland - Data Science for Big Data Analytics

Luxembourg - Data Science for Big Data Analytics

Latvia - Data Science for Big Data Analytics

España - Ciencia de Datos para Big Data Analytics

Italia - Data Science for Big Data Analytics

Lithuania - Data Science for Big Data Analytics

Nederland - Data Science for Big Data Analytics

Norway - Data Science for Big Data Analytics

Portugal - Data Science for Big Data Analytics

România - Data Science for Big Data Analytics

Sverige - Data Science for Big Data Analytics

Türkiye - Data Science for Big Data Analytics

Malta - Data Science for Big Data Analytics

Belgique - Data Science for Big Data Analytics

France - Data Science for Big Data Analytics

日本 - Data Science for Big Data Analytics

Australia - Data Science for Big Data Analytics

Malaysia - Data Science for Big Data Analytics

New Zealand - Data Science for Big Data Analytics

Philippines - Data Science for Big Data Analytics

Singapore - Data Science for Big Data Analytics

Thailand - Data Science for Big Data Analytics

Vietnam - Data Science for Big Data Analytics

India - Data Science for Big Data Analytics

Argentina - Ciencia de Datos para Big Data Analytics

Chile - Ciencia de Datos para Big Data Analytics

Costa Rica - Ciencia de Datos para Big Data Analytics

Ecuador - Ciencia de Datos para Big Data Analytics

Guatemala - Ciencia de Datos para Big Data Analytics

Colombia - Ciencia de Datos para Big Data Analytics

México - Ciencia de Datos para Big Data Analytics

Panama - Ciencia de Datos para Big Data Analytics

Peru - Ciencia de Datos para Big Data Analytics

Uruguay - Ciencia de Datos para Big Data Analytics

Venezuela - Ciencia de Datos para Big Data Analytics

Polska - Data Science for Big Data Analytics

United Kingdom - Data Science for Big Data Analytics

South Korea - Data Science for Big Data Analytics

Pakistan - Data Science for Big Data Analytics

Sri Lanka - Data Science for Big Data Analytics

Bulgaria - Data Science for Big Data Analytics

Bolivia - Ciencia de Datos para Big Data Analytics

Indonesia - Data Science for Big Data Analytics

Kazakhstan - Data Science for Big Data Analytics

Moldova - Data Science for Big Data Analytics

Morocco - Data Science for Big Data Analytics

Tunisia - Data Science for Big Data Analytics

Kuwait - Data Science for Big Data Analytics

Oman - Data Science for Big Data Analytics

Slovakia - Data Science for Big Data Analytics

Kenya - Data Science for Big Data Analytics

Nigeria - Data Science for Big Data Analytics

Botswana - Data Science for Big Data Analytics

Slovenia - Data Science for Big Data Analytics

Croatia - Data Science for Big Data Analytics

Serbia - Data Science for Big Data Analytics

Bhutan - Data Science for Big Data Analytics

Nepal - Data Science for Big Data Analytics

Uzbekistan - Data Science for Big Data Analytics