Course Code: sparkhadoop
Duration: 21 hours
Course Outline:

Understanding Hadoop

Introduction

  • Hadoop history, concepts
  • Ecosystem
  • Distributions
  • High level architecture
  • Hadoop myths
  • Hadoop challenges (hardware / software)

Planning and installation

  • Selecting software, Hadoop distributions
  • Sizing the cluster, planning for growth
  • Selecting hardware and network
  • Rack topology
  • Installation
  • Multi-tenancy
  • Directory structure, logs
  • Benchmarking

HDFS operations

  • Concepts (horizontal scaling, replication, data locality, rack awareness)
  • Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
  • Health monitoring
  • Command-line and browser-based administration
  • Adding storage, replacing defective drives

Data ingestion

  • Flume for logs and other data ingestion into HDFS
  • Sqoop for importing from SQL databases to HDFS, as well as exporting back to SQL
  • Hadoop data warehousing with Hive
  • Copying data between clusters (distcp)
  • Using S3 as complementary to HDFS
  • Data ingestion best practices and architectures

MapReduce operations and administration

  • Parallel computing before mapreduce: compare HPC vs Hadoop administration
  • MapReduce cluster loads
  • Nodes and Daemons (JobTracker, TaskTracker)
  • MapReduce UI walk through
  • Mapreduce configuration
  • Job config
  • Optimizing MapReduce
  • Fool-proofing MR: what to tell your programmers

YARN: new architecture and new capabilities

  • YARN design goals and implementation architecture
  • New actors: ResourceManager, NodeManager, Application Master
  • Installing YARN
  • Job scheduling under YARNApache Spark

Apache Spark

Why Spark?

  • Problems with Traditional Large-Scale Systems
  • Introducing Spark

Spark Basics

  • What is Apache Spark?
  • Using the Spark Shell
  • Resilient Distributed Datasets (RDDs)
  • Functional Programming with Spark

Working with RDDs

  • RDD Operations
  • Key-Value Pair RDDs
  • MapReduce and Pair RDD Operations

Running Spark on a Cluster

  • Overview
  • A Spark Standalone Cluster
  • The Spark Standalone Web UI

Parallel Programming with Spark

  • RDD Partitions and HDFS Data Locality
  • Working With Partitions
  • Executing Parallel Operations

Caching and Persistence

  • RDD Lineage
  • Caching Overview
  • Distributed Persistence

Writing Spark Applications

  • Spark Applications vs. Spark Shell
  • Creating the SparkContext
  • Configuring Spark Properties
  • Building and Running a Spark Application
  • Logging

Spark, Hadoop, and the Enterprise Data Center

  • Overview
  • Spark and the Hadoop Ecosystem
  • Spark and MapReduce