Big Data Analytics | NobleProg HR

Course Code: bdanash1

Duration: 21 hours

Prerequisites:

Understanding of traditional data management and analysis methods like SQL, data warehouses, business intelligence, OLAP, etc... Understanding of basic statistics and probability (mean, variance, probability, conditional probability, etc....)

Overview:

Audience

If you try to make sense out of the data you have access to or want to analyse unstructured data available on the net (like Twitter, Linked in, etc...) this course is for you.

It is mostly aimed at people who need to choose what data is worth collecting and what is worth analyzing.

It is not aimed at people configuring the solution, those people will benefit from the big picture though.

Delivery Mode

During the course delegates will be presented with working examples of mostly open source technologies.

Short lectures will be followed by presentation and simple exercises by the participants

Content and Software used

All software used is updated each time the course is run, so we check the newest versions possible.

It covers the process from obtaining, formatting, processing and analysing the data, to explain how to automate decision making process with machine learning.

Course Outline:

Day 1: Big Data Analytics (8.5 hours)

Quick Overview

Data Sources
Mining Data
Recommender systems
Datatypess
- Structured vs unstructured
- Static vs streamed
- Data-driven vs user-driven analytics
- data validity

Models and Classification

Statistical Models
Classification
Clustering: kGroups, k-means, nearest neighbours
Ant colonies, birds flocking

Predictive Models

Decision trees
Support vector machine
Naive Bayes classification
Markov Model
Regression
Ensemble methods

Building Models

Data Preparation (MapReduce)
Data cleansing
Developing and testing a model
Model evaluation, deployment and integration

Overview of Open Source and commercial software

Selection of R-project package
Python libraries
Hadoop and Mahout
Selected Apache projects related to Big Data and Analytics
Selected commercial solution
Integration with existing software and data sources

Day 2: Mahout and Spark (8.5 hours)

Implementing Recommendation Systems with Mahout

Introduction to recommender systems
Representing recommender data
Making recommendation
Optimizing recommendation

Spark basics

Spark and Hadoop
Spark concepts and architecture
Spark eco system (core, spark sql, mlib, streaming)
Labs : Installing and running Spark
Running Spark in local mode
Spark web UI
Spark shell
Inspecting RDDs
Labs: Spark shell exploration

Spark API programming

Introduction to Spark API / RDD API
Submitting the first program to Spark
Debugging / logging
Configuration properties

Spark and Hadoop

Hadoop Intro (HDFS / YARN)
Hadoop + Spark architecture
Running Spark on Hadoop YARN
Processing HDFS files using Spark

Spark Operations

Deploying Spark in production
Sample deployment templates
Configurations
Monitoring
Troubleshooting

Day 3 : Google Cloud Platform Big Data & Machine Learning Fundamentals (4 hours)

Data Analytics on the Cloud

What is the Google Cloud Platform?
GCP Big Data Products
CloudSQL: your SQL database on the cloud
- A no-ops database
- Lab: importing data into CloudSQL and running queries on rentals data
Dataproc
- Managed Hadoop + Pig + Spark on the cloud
- Lab: Machine Learning with SparkML

Scaling data analysis

Fast random access
- Datastore: Key-Entity
- BigTable: wide-column
Datalab
- Why Datalab? (interactive, iterative)
- Demo: Sample notebook in datalab
BigQuery
- Interactive queries on petabytes
- Lab: Build machine learning dataset
Machine Learning with TensorFlow
- TensorFlow
- Lab: Train and use neural network
Fully built models for common needs
- Vision API
- Translate API
- Lab: Translate
Genomics API (optional)
- What is linkage disequilibrium?
- Finding LD using Dataflow and BigQuery

Data processing architectures

Asynchronous processing with TaskQueues
Message-oriented architectures with Pub/Sub
Creating pipelines with Dataflow

Summary

Where to go from here
Resources