Hadoop, Spark MongoDB , Kafka & Storm architecture

Course Code: hsmks

Duration: 49 hours

Prerequisites:

There are no specific requirements needed to attend this course.

Overview:

OUTLINE

1.1 Hadoop (1 day)
1.2 Oozie (0.75 to 1day)
1.3 Kafka (1.5 day)
1.4 Apache Storm (1.5 days 2 if necessary)
1.5 MongoDB (1.5 days)

Course Outline:

1.1 Hadoop (1 day)

For this part of the training we assume the attendees are already familiar with map reduce & the
architecture of Hadoop.

• Compare the different file format storage options that the trainer will have implemented:
o Performance optimization
o Consequences on querying the data

• Run the map reduce jobs the trainer will have implemented:

o Map reduce job for the entire processing of the data (the data is not pre-
processed in Storm/Trident)

o Map reduce job for the processing of Storm/Trident pre-processed data
o Map reduce job to push the data to a MongoDB store

1.2 Oozie (0.75 to 1 day)

For this part of the training we assume the attendees have no previous experience or knowledge
of Oozie.

• Introduction to Oozie

• Features

• Installation (lab)

• Configuration of Oozie workflows (labs) to schedule the launch of Hadoop jobs to
o Transforms raw data into processed data
o Transforms Strom/trident pre-processed data into processed data
o Pushes the data to the MongoDB store

• The workflows will be triggered based time and/or amount of data to process
o Code provided by trainer and eventually adapted during session

1.3 Kafka (1.5 day)

• Introduction

• Architecture and features

• Cluster installation and configuration (lab)

• Create a stream applications (lab(s))

• Subscribe to one or more target data sources (mail) and Integrate with Storm (lab(s))
o Code provided by trainer and eventually adapted during session

1.4 Apache Storm (1.5 days 2 if necessary)

For this part of the training we assume the attendees have no previous experience or knowledge
of Storm or Trident.

• Introduction to Storm

• Architecture and features

• Cluster setup (theory and lab)

• Integrating Storm with Hadoop (theory and lab(s))

• Storm topologies (theory and lab(s))
o Spouts
o Bolts

• The contribution of Trident to Storm: transactions (theory and lab(s))

• Illustrate complete processing or part processing of mail data in different formats

o Code provided by trainer and eventually adapted during session

1.5 MongoDB (1.5 days if needed by client)

For this part of the training we assume the attendees are already familiar with MongoDB namely
the querying of data and will focus on the cluster administration.

• Cluster setup

• Creation and administration of replica sets (non sharded clusters)

• Creation and administration of sharded replica sets (sharded clusters)

• Backup and restore of data

• Integrating MongoDB with Oozie and Hadoop

• Monitoring MongoDB