Spark for Developers and TensorFlow

Course Code: spdevtensor

Duration: 21 hours

Prerequisites:

Familiarity with either Java / Scala / Python language (our labs in Scala and Python)
Basic understanding of Linux development environment (command line navigation / editing files using VI or nano)
Familiarity with Statistics

Overview:

This course combines Apache Spark & TensorFlow.

Apache Spark:

The students will learn how Spark fits into the Big Data ecosystem, and how to use Spark for data analysis. The course covers Spark shell for interactive data analysis, Spark internals, Spark APIs, Spark SQL, Spark streaming, and machine learning and graphX.

TensorFlow:

TensorFlow is a 2nd Generation API of Google's open source software library for Deep Learning. The system is designed to facilitate research in machine learning, and to make it quick and easy to transition from research prototype to production system; this course is intended for those seeking to use TensorFlow for their Deep Learning projects.

Course Outline:

Spark

Spark Basics

Background and history
Spark and Hadoop
Spark concepts and architecture
Spark eco system (core, spark sql, mlib, streaming)

First Look at Spark

Running Spark in local mode
Spark web UI
Spark shell
Analyzing dataset – part 1
Inspecting RDDs

RDDs

RDDs concepts
Partitions
RDD Operations / transformations
RDD types
Key-Value pair RDDs
MapReduce on RDD
Caching and persistence

Spark API programming

Introduction to Spark API / RDD API
Submitting the first program to Spark
Configuration properties

Spark SQL

SQL support in Spark
Dataframes
Defining tables and importing datasets
Querying data frames using SQL
Storage formats : JSON / Parquet

Mlib

mlib intro
mlib algorithms

GraphX

GraphX library overview
GraphX APIs

Spark Streaming

Streaming overview
Evaluating Streaming platforms
Streaming operations
Sliding window operations

Spark and Hadoop (optional and time permitting)

Hadoop Intro (HDFS / YARN)
Hadoop + Spark architecture
Running Spark on Hadoop YARN
Processing HDFS files using Spark

Spark Performance and Tuning (optional and time permitting)

Broadcast variables
Accumulators
Memory management & caching

TensorFlow

Machine Learning and Recursive Neural Networks (RNN) basics

NN and RNN
Backprogation
Long short-term memory (LSTM)

TensorFlow Basics

Creation, Initializing, Saving, and Restoring TensorFlow variables
Feeding, Reading and Preloading TensorFlow Data
How to use TensorFlow infrastructure to train models at scale
Visualizing and Evaluating models with TensorBoard

TensorFlow Mechanics

Prepare the Data
- Download
- Inputs and Placeholders
Build the Graph
- Inference
- Loss
- Training
Train the Model
- The Graph
- The Session
- Train Loop
Evaluate the Model
- Build the Eval Graph
- Eval Output

Advanced Usage (optional and time permitting)

Threading and Queues
Distributed TensorFlow
Writing Documentation and Sharing your Model
Customizing Data Readers
Using GPUs
Manipulating TensorFlow Model Files

TensorFlow Serving (optional and time permitting)

Introduction
Basic Serving Tutorial
Advanced Serving Tutorial
Serving Inception Model Tutorial