Course Code: bdanash1
Duration: 21 hours
Prerequisites:

Understanding of traditional data management and analysis methods like SQL, data warehouses, business intelligence, OLAP, etc... Understanding of basic statistics and probability (mean, variance, probability, conditional probability, etc....)

Overview:

Audience

If you try to make sense out of the data you have access to or want to analyse unstructured data available on the net (like Twitter, Linked in, etc...) this course is for you.

It is mostly aimed at people who need to choose what data is worth collecting and what is worth analyzing.

It is not aimed at people configuring the solution, those people will benefit from the big picture though.

Delivery Mode

During the course delegates will be presented with working examples of mostly open source technologies.

Short lectures will be followed by presentation and simple exercises by the participants

Content and Software used

All software used is updated each time the course is run, so we check the newest versions possible.

It covers the process from obtaining, formatting, processing and analysing the data, to explain how to automate decision making process with machine learning.

Course Outline:

Day 1: Big Data Analytics (8.5 hours)

Quick Overview

  • Data Sources
  • Mining Data
  • Recommender systems
  • Datatypess
    • Structured vs unstructured
    • Static vs streamed
    • Data-driven vs user-driven analytics
    • data validity

Models and Classification

  • Statistical Models
  • Classification
  • Clustering: kGroups, k-means, nearest neighbours
  • Ant colonies, birds flocking

Predictive Models

  • Decision trees
  • Support vector machine
  • Naive Bayes classification
  • Markov Model
  • Regression
  • Ensemble methods

Building Models

  • Data Preparation (MapReduce)
  • Data cleansing
  • Developing and testing a model
  • Model evaluation, deployment and integration

Overview of Open Source and commercial software

  • Selection of R-project package
  • Python libraries
  • Hadoop and Mahout
  • Selected Apache projects related to Big Data and Analytics
  • Selected commercial solution
  • Integration with existing software and data sources

Day 2: Mahout and Spark (8.5 hours)

Implementing Recommendation Systems with Mahout

  • Introduction to recommender systems
  • Representing recommender data
  • Making recommendation
  • Optimizing recommendation

Spark basics

  • Spark and Hadoop
  • Spark concepts and architecture
  • Spark eco system (core, spark sql, mlib, streaming)
  • Labs : Installing and running Spark
  • Running Spark in local mode
  • Spark web UI
  • Spark shell
  • Inspecting RDDs
  • Labs: Spark shell exploration

Spark API programming

  • Introduction to Spark API / RDD API
  • Submitting the first program to Spark
  • Debugging / logging
  • Configuration properties

Spark and Hadoop

  • Hadoop Intro (HDFS / YARN)
  • Hadoop + Spark architecture
  • Running Spark on Hadoop YARN
  • Processing HDFS files using Spark

Spark Operations

  • Deploying Spark in production
  • Sample deployment templates
  • Configurations
  • Monitoring
  • Troubleshooting

Day 3 : Google Cloud Platform Big Data & Machine Learning Fundamentals (4 hours)

Data Analytics on the Cloud

  • What is the Google Cloud Platform?
  • GCP Big Data Products
  • CloudSQL: your SQL database on the cloud
    • A no-ops database
    • Lab: importing data into CloudSQL and running queries on rentals data
  • Dataproc
    • Managed Hadoop + Pig + Spark on the cloud
    • Lab: Machine Learning with SparkML

Scaling data analysis

  • Fast random access
    • Datastore: Key-Entity
    • BigTable: wide-column
  • Datalab
    • Why Datalab? (interactive, iterative)
    • Demo: Sample notebook in datalab
  • BigQuery
    • Interactive queries on petabytes
    • Lab: Build machine learning dataset
  • Machine Learning with TensorFlow
    • TensorFlow
    • Lab: Train and use neural network
  • Fully built models for common needs
    • Vision API
    • Translate API
    • Lab: Translate
  • Genomics API (optional)
    • What is linkage disequilibrium?
    • Finding LD using Dataflow and BigQuery

Data processing architectures

  • Asynchronous processing with TaskQueues
  • Message-oriented architectures with Pub/Sub
  • Creating pipelines with Dataflow

Summary

  • Where to go from here
  • Resources