Attendees are required to be familiar with Java for the part that relates to writing Hadoop queries in Java.
Part 1
This course is intended to introduce and deepen both predictive analytics on one hand and big data tools for data administration and querying on the second hand. An introduction to machine learning and an exploration of probability concepts for machine learning will provide both an overview and a set of tools that can be reused for all types of models. Models for linear and logistic regression are fully explored and an introduction to neural networks is given. On a Cloudera installation with Hadoop and Spark attendees will explore the fundamental concepts for Hadoop (HDFS, map reduce...) and Spark (RDDs ...) as well as the tools for querying and administering data.
Part 2
This course covers the installation of Hadoop and Spark in both a standard way as well as using the Cloudera platform. The integration of R and Python is implemented with Spark and with one another. More predictive analytics models are explored and apprehended.
Part 1 (7-9 days)
Predictive Analytics
- Introduction to machine learning
- Introduction to probability
- Linear Regression
- Logistic Regression
- Neural Networks
Hadoop
- What is Hadoop
- Structure and Unstructured Data
- HDFS
- Clusters
- The Cloudera GUI
- Moving Data into Hadoop
- Moving Data out of Hadoop
- Hive
- Impala
- Introduction to Oozie (Time Permitting)
- Pig
- Map Reduce
- Streamlining HDFS
- Performance Diagnosis and Optimization Techniques
Spark
- Spark Ecosystem
- Spark Programming Model
- Running Applications
Part 2 (5 days)
Python / R basic integration
- Integration options
- Calling Python from R
- Calling R from Python
Installations
- Installation of hadoop and spark using “Cloudera”
- Hadoop manual installation
- Standalone Spark manual installation
- Spark integration to Hadoop
Spark for R and Python
- Spark and Data
- Spark SQL
- Spark Streaming
- SparkR
- PySpark
Predictive Analytics 2
- Support vector machine
- Tree Based methods
- Ensemble Methods
- Time Series