Handle Big Data with Apache Spark and Hadoop

Course Code: sparkhadoop

Duration: 21 hours

Course Outline:

Understanding Hadoop

Introduction

Hadoop history, concepts
Ecosystem
Distributions
High level architecture
Hadoop myths
Hadoop challenges (hardware / software)

Planning and installation

Selecting software, Hadoop distributions
Sizing the cluster, planning for growth
Selecting hardware and network
Rack topology
Installation
Multi-tenancy
Directory structure, logs
Benchmarking

HDFS operations

Concepts (horizontal scaling, replication, data locality, rack awareness)
Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
Health monitoring
Command-line and browser-based administration
Adding storage, replacing defective drives

Data ingestion

Flume for logs and other data ingestion into HDFS
Sqoop for importing from SQL databases to HDFS, as well as exporting back to SQL
Hadoop data warehousing with Hive
Copying data between clusters (distcp)
Using S3 as complementary to HDFS
Data ingestion best practices and architectures

MapReduce operations and administration

Parallel computing before mapreduce: compare HPC vs Hadoop administration
MapReduce cluster loads
Nodes and Daemons (JobTracker, TaskTracker)
MapReduce UI walk through
Mapreduce configuration
Job config
Optimizing MapReduce
Fool-proofing MR: what to tell your programmers

YARN: new architecture and new capabilities

YARN design goals and implementation architecture
New actors: ResourceManager, NodeManager, Application Master
Installing YARN
Job scheduling under YARNApache Spark

Apache Spark

Why Spark?

Problems with Traditional Large-Scale Systems
Introducing Spark

Spark Basics

What is Apache Spark?
Using the Spark Shell
Resilient Distributed Datasets (RDDs)
Functional Programming with Spark

Working with RDDs

RDD Operations
Key-Value Pair RDDs
MapReduce and Pair RDD Operations

Running Spark on a Cluster

Overview
A Spark Standalone Cluster
The Spark Standalone Web UI

Parallel Programming with Spark

RDD Partitions and HDFS Data Locality
Working With Partitions
Executing Parallel Operations

Caching and Persistence

RDD Lineage
Caching Overview
Distributed Persistence

Writing Spark Applications

Spark Applications vs. Spark Shell
Creating the SparkContext
Configuring Spark Properties
Building and Running a Spark Application
Logging

Spark, Hadoop, and the Enterprise Data Center

Overview
Spark and the Hadoop Ecosystem
Spark and MapReduce