Course Code: preddata
Duration: 98 hours
Prerequisites:

Attendees are required to be familiar with Java for the part that relates to writing Hadoop queries in Java.

Overview:

Part 1

This course is intended to introduce and deepen both predictive analytics on one hand and big data tools for data administration and querying on the second hand.  An introduction to machine learning and an exploration of probability concepts for machine learning will provide both an overview and a set of tools that can be reused for all types of models.  Models for linear and logistic regression are fully explored and an introduction to neural networks is given.  On a Cloudera installation with Hadoop and Spark attendees will explore the fundamental concepts for Hadoop (HDFS, map reduce...) and Spark (RDDs ...) as well as the tools for querying and administering data.

Part 2

This course covers the installation of Hadoop and Spark in both a standard way as well as using the Cloudera platform.  The integration of R and Python is implemented with Spark and with one another. More predictive analytics models are explored and apprehended.

Course Outline:

Part 1 (7-9 days)

Predictive Analytics

  • Introduction to machine learning
  • Introduction to probability
  • Linear Regression
  • Logistic Regression
  • Neural Networks

Hadoop

  • What is Hadoop
  • Structure and Unstructured Data
  • HDFS
  • Clusters
  • The Cloudera GUI
  • Moving Data into Hadoop
  • Moving Data out of Hadoop
  • Hive
  • Impala
  • Introduction to Oozie (Time Permitting)
  • Pig
  • Map Reduce
  • Streamlining HDFS
  • Performance Diagnosis and Optimization Techniques

Spark

  • Spark Ecosystem
  • Spark Programming Model
  • Running Applications

Part 2 (5 days)

Python / R basic integration

  • Integration options
  • Calling Python from R
  • Calling R from Python

Installations

  • Installation of hadoop and spark using “Cloudera”
  • Hadoop manual installation
  • Standalone Spark manual installation
  • Spark integration to Hadoop

Spark for R and Python

  • Spark and Data
  • Spark SQL
  • Spark Streaming
  • SparkR
  • PySpark

Predictive Analytics 2

  • Support vector machine
  • Tree Based methods
  • Ensemble Methods
  • Time Series