- Students should have intermediate SQL and Python programming skills.
Data Engineering and Architecture complement Machine Learning (ML) by allowing for the storage, pre-processing, and deployment of ML models. Although included within Data Science, Data Engineering and Architecture are often overlooked as critical ingredients in the successful development of predictive systems and platforms within companies. This course focuses on how companies can employ on-premises and cloud-based stacks to successfully implement data pipelines and deployment architectures for any predictive or prescriptive models they have developed. Along the way, participants will be able to practice these skills with hands-on exercises and learn from case studies of successful implementations of these technologies.
By the end of this training, participants will be able to:
- Understand full end-to-end development of ML models
- Apply the latest Deep Learning models for Time Series, Image Recognition, and NLP
- Work with different types of data - unstructured, semi-structured, and structured
- Understand how different types of data can be stored on-prem - Hadoop, InfluxDB, ElasticSearch, Neo4J, and Cassandra
- Build and interact with a cloud-based data lake
- Automate and monitor data pipelines
- Develop proficiency in Spark, Airflow, and AWS/GCP tools
- Additional Knowledge with Big Data and Hadoop
- Deployment and versioning of ML models with TensorFlow-serving
Day 1
Module 1: Machine Learning and Deep Learning (90 mins)
● Working with TensorFlow and Pytorch
● Applications of ML - Image recognition, NLP, Time Series, and more
● Toolset - Jupyter, Pandas, NumPy, Tensorflow, Pytorch, Scikit-learn, etc
Module 2: SparkSQL, DataFrames, and Datasets (90 mins)
● SparkSQL
● Executing SQL commands on a dataframe
● Using Dataframes instead of RDD’s
● Spark MLLib
Module 3: Data Lakes with Hadoop and Spark (90 mins)
● Introduction to Data Lakes
● The Power of Spark
● Data Wrangling with Spark
Module 4: Hands-on Exercises (90 min)
● SparkSQL
● SparkMLLib
● Deep Learning with Tensorflow 2.0 - CNN’s, LSTM’s, Transformers, and Autoencoders
Module 5: Automate Data Pipelines (90 mins)
● Data Pipelines
● Create data pipelines with Apache Airflow and Apache NiFi
● Data Quality
● Track Data Lineage
● Production Data Pipelines
Module 6: Unstructured Binary Data with Hadoop (90 mins)
● Veracity, variability, visualization, and value (4 V’s)
● HDFS and MapReduce in Hadoop
● Unstructured, semi-structured, and structured data
Module 7: Hands-on Exercises (90 min)
● Apache Airflow and Nifi - creating real-time and batch data pipelines
● Hadoop - HDFS and Map-Reduce
Day 2
Module 7: Structured Big Data with Cassandra (90 mins)
● Introduction to Cassandra
● Cassandra and CQL
● Data Modeling with NoSQL
Module 8: Time Series Data with InfluxDB (90 mins)
● The TICK stack
● Data modeling and Querying using InfluxDB - InfluxQL and Flux
● Visualizing time series data
Module 9: Graph-based Data with Neo4J (90 mins)
● Introduction to Neo4J
● Introduction to Data modeling and querying with graph databases - Cypher for Neo4J
● EDA, Recommendations, and Predictions with Neo4J
● Similarity metrics
Module 10: ML Deployment Infrastructure (90 mins)
● On-Premises deployment with Flask, Pickle, and Tensorflow/Pytorch
● Cloud deployment with Heroku, AWS and GCP (Tensorflow serving)
● Versioning and logging ML models in production
Module 11: Case Studies (90 mins)
● Time series prediction with stock market data - using LSTM’s
● Unstructured data Image recognition - using CNN’s
● Anomaly detection - using Autoencoders
● Text Classification and Captioning - using CNN’s and RNN’s
Module 12: Hands-on Exercises (90 mins)
● Cassandra CQL
● InfluxDB InfluxQL and Flux
● Neo4J Cypher
● Tensorflow serving