Course Code: spdevtensor
Duration: 21 hours
Prerequisites:
  • Familiarity with either Java / Scala / Python language (our labs in Scala and Python)
  • Basic understanding of Linux development environment (command line navigation / editing files using VI or nano)
  • Familiarity with Statistics
Overview:

This course combines Apache Spark & TensorFlow.

Apache Spark:

The students will learn how  Spark fits  into the Big Data ecosystem, and how to use Spark for data analysis.  The course covers Spark shell for interactive data analysis, Spark internals, Spark APIs, Spark SQL, Spark streaming, and machine learning and graphX.

TensorFlow:

TensorFlow is a 2nd Generation API of Google's open source software library for Deep Learning. The system is designed to facilitate research in machine learning, and to make it quick and easy to transition from research prototype to production system; this course is intended for those seeking to use TensorFlow for their Deep Learning projects.

Course Outline:

Spark

Spark Basics

  • Background and history
  • Spark and Hadoop
  • Spark concepts and architecture
  • Spark eco system (core, spark sql, mlib, streaming)

First Look at Spark

  • Running Spark in local mode
  • Spark web UI
  • Spark shell
  • Analyzing dataset – part 1
  • Inspecting RDDs

RDDs

  • RDDs concepts
  • Partitions
  • RDD Operations / transformations
  • RDD types
  • Key-Value pair RDDs
  • MapReduce on RDD
  • Caching and persistence

Spark API programming

  • Introduction to Spark API / RDD API
  • Submitting the first program to Spark
  • Configuration properties

Spark SQL

  • SQL support in Spark
  • Dataframes
  • Defining tables and importing datasets
  • Querying data frames using SQL
  • Storage formats : JSON / Parquet

Mlib

  • mlib intro
  • mlib algorithms

GraphX

  • GraphX library overview
  • GraphX APIs

Spark Streaming

  • Streaming overview
  • Evaluating Streaming platforms
  • Streaming operations
  • Sliding window operations

Spark and Hadoop (optional and time permitting)

  • Hadoop Intro (HDFS / YARN)
  • Hadoop + Spark architecture
  • Running Spark on Hadoop YARN
  • Processing HDFS files using Spark

Spark Performance and Tuning (optional and time permitting)

  • Broadcast variables
  • Accumulators
  • Memory management & caching

TensorFlow

Machine Learning and Recursive Neural Networks (RNN) basics

  • NN and RNN
  • Backprogation
  • Long short-term memory (LSTM)

TensorFlow Basics

  • Creation, Initializing, Saving, and Restoring TensorFlow variables
  • Feeding, Reading and Preloading TensorFlow Data
  • How to use TensorFlow infrastructure to train models at scale
  • Visualizing and Evaluating models with TensorBoard

TensorFlow Mechanics

  • Prepare the Data
    • Download
    • Inputs and Placeholders
  • Build the Graph
    • Inference
    • Loss
    • Training
  • Train the Model
    • The Graph
    • The Session
    • Train Loop
  • Evaluate the Model
    • Build the Eval Graph
    • Eval Output

Advanced Usage (optional and time permitting)

  • Threading and Queues
  • Distributed TensorFlow
  • Writing Documentation and Sharing your Model
  • Customizing Data Readers
  • Using GPUs
  • Manipulating TensorFlow Model Files

TensorFlow Serving (optional and time permitting)

  • Introduction
  • Basic Serving Tutorial
  • Advanced Serving Tutorial
  • Serving Inception Model Tutorial