Understanding of traditional data management and analysis methods like SQL, data warehouses, business intelligence, OLAP, etc... Understanding of basic statistics and probability (mean, variance, probability, conditional probability, etc....)
Audience
If you try to make sense out of the data you have access to or want to analyse unstructured data available on the net (like Twitter, Linked in, etc...) this course is for you.
It is mostly aimed at people who need to choose what data is worth collecting and what is worth analyzing.
It is not aimed at people configuring the solution, those people will benefit from the big picture though.
Delivery Mode
During the course delegates will be presented with working examples of mostly open source technologies.
Short lectures will be followed by presentation and simple exercises by the participants
Content and Software used
All software used is updated each time the course is run, so we check the newest versions possible.
It covers the process from obtaining, formatting, processing and analysing the data, to explain how to automate decision making process with machine learning.
Day 1: Big Data Analytics (8.5 hours)
Quick Overview
- Data Sources
- Mining Data
- Recommender systems
- Datatypess
- Structured vs unstructured
- Static vs streamed
- Data-driven vs user-driven analytics
- data validity
Models and Classification
- Statistical Models
- Classification
- Clustering: kGroups, k-means, nearest neighbours
- Ant colonies, birds flocking
Predictive Models
- Decision trees
- Support vector machine
- Naive Bayes classification
- Markov Model
- Regression
- Ensemble methods
Building Models
- Data Preparation (MapReduce)
- Data cleansing
- Developing and testing a model
- Model evaluation, deployment and integration
Overview of Open Source and commercial software
- Selection of R-project package
- Python libraries
- Hadoop and Mahout
- Selected Apache projects related to Big Data and Analytics
- Selected commercial solution
- Integration with existing software and data sources
Day 2: Mahout and Spark (8.5 hours)
Implementing Recommendation Systems with Mahout
- Introduction to recommender systems
- Representing recommender data
- Making recommendation
- Optimizing recommendation
Spark basics
- Spark and Hadoop
- Spark concepts and architecture
- Spark eco system (core, spark sql, mlib, streaming)
- Labs : Installing and running Spark
- Running Spark in local mode
- Spark web UI
- Spark shell
- Inspecting RDDs
- Labs: Spark shell exploration
Spark API programming
- Introduction to Spark API / RDD API
- Submitting the first program to Spark
- Debugging / logging
- Configuration properties
Spark and Hadoop
- Hadoop Intro (HDFS / YARN)
- Hadoop + Spark architecture
- Running Spark on Hadoop YARN
- Processing HDFS files using Spark
Spark Operations
- Deploying Spark in production
- Sample deployment templates
- Configurations
- Monitoring
- Troubleshooting
Day 3 : Google Cloud Platform Big Data & Machine Learning Fundamentals (4 hours)
Data Analytics on the Cloud
- What is the Google Cloud Platform?
- GCP Big Data Products
- CloudSQL: your SQL database on the cloud
- A no-ops database
- Lab: importing data into CloudSQL and running queries on rentals data
- Dataproc
- Managed Hadoop + Pig + Spark on the cloud
- Lab: Machine Learning with SparkML
Scaling data analysis
- Fast random access
- Datastore: Key-Entity
- BigTable: wide-column
- Datalab
- Why Datalab? (interactive, iterative)
- Demo: Sample notebook in datalab
- BigQuery
- Interactive queries on petabytes
- Lab: Build machine learning dataset
- Machine Learning with TensorFlow
- TensorFlow
- Lab: Train and use neural network
- Fully built models for common needs
- Vision API
- Translate API
- Lab: Translate
- Genomics API (optional)
- What is linkage disequilibrium?
- Finding LD using Dataflow and BigQuery
Data processing architectures
- Asynchronous processing with TaskQueues
- Message-oriented architectures with Pub/Sub
- Creating pipelines with Dataflow
Summary
- Where to go from here
- Resources