Course Code:
sparkhadoop
Duration:
21 hours
Course Outline:
Understanding Hadoop
Introduction
- Hadoop history, concepts
- Ecosystem
- Distributions
- High level architecture
- Hadoop myths
- Hadoop challenges (hardware / software)
Planning and installation
- Selecting software, Hadoop distributions
- Sizing the cluster, planning for growth
- Selecting hardware and network
- Rack topology
- Installation
- Multi-tenancy
- Directory structure, logs
- Benchmarking
HDFS operations
- Concepts (horizontal scaling, replication, data locality, rack awareness)
- Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
- Health monitoring
- Command-line and browser-based administration
- Adding storage, replacing defective drives
Data ingestion
- Flume for logs and other data ingestion into HDFS
- Sqoop for importing from SQL databases to HDFS, as well as exporting back to SQL
- Hadoop data warehousing with Hive
- Copying data between clusters (distcp)
- Using S3 as complementary to HDFS
- Data ingestion best practices and architectures
MapReduce operations and administration
- Parallel computing before mapreduce: compare HPC vs Hadoop administration
- MapReduce cluster loads
- Nodes and Daemons (JobTracker, TaskTracker)
- MapReduce UI walk through
- Mapreduce configuration
- Job config
- Optimizing MapReduce
- Fool-proofing MR: what to tell your programmers
YARN: new architecture and new capabilities
- YARN design goals and implementation architecture
- New actors: ResourceManager, NodeManager, Application Master
- Installing YARN
- Job scheduling under YARNApache Spark
Apache Spark
Why Spark?
- Problems with Traditional Large-Scale Systems
- Introducing Spark
Spark Basics
- What is Apache Spark?
- Using the Spark Shell
- Resilient Distributed Datasets (RDDs)
- Functional Programming with Spark
Working with RDDs
- RDD Operations
- Key-Value Pair RDDs
- MapReduce and Pair RDD Operations
Running Spark on a Cluster
- Overview
- A Spark Standalone Cluster
- The Spark Standalone Web UI
Parallel Programming with Spark
- RDD Partitions and HDFS Data Locality
- Working With Partitions
- Executing Parallel Operations
Caching and Persistence
- RDD Lineage
- Caching Overview
- Distributed Persistence
Writing Spark Applications
- Spark Applications vs. Spark Shell
- Creating the SparkContext
- Configuring Spark Properties
- Building and Running a Spark Application
- Logging
Spark, Hadoop, and the Enterprise Data Center
- Overview
- Spark and the Hadoop Ecosystem
- Spark and MapReduce