- A general understanding of math
- A general understanding of programming
- A general understanding of databases
Participants who complete this training will gain a practical, real-world understanding of Big Data and its related technologies, methodologies and tools.
Participants will have the opportunity to put this knowledge into practice through hands-on exercises. Group interaction and instructor feedback make up an important component of the class.
The course starts with an introduction to elemental concepts of Big Data, then progresses into the programming languages and methodologies used to perform Data Analysis. Finally, we discuss the tools and infrastructure that enable Big Data storage, Distributed Processing, and Scalability.
Audience
- Developers / programmers
- IT consultants
Format of the course
Part lecture, part discussion, heavy hands-on practice and implementation, occasional quizing to measure progress.
Introduction to Data Analysis and Big Data
- What makes Big Data "big"?
- Velocity, Volume, Variety, Veracity (VVVV)
- Limits to traditional Data Processing
- Distributed Processing
- Statistical Analysis
- Types of Machine Learning Analysis
- Data Visualization
- Distributed Processing
- MapReduce
Languages used for Data Analysis
- R language (crash course)
- Python (crash course)
Approaches to Data Analysis
- Statistical Analysis
- Time Series analysis
- Forecasting with Correlation and Regression models
- Inferential Statistics (estimating)
- Descriptive Statistics in Big Data sets (e.g. calculating mean)
- Machine Learning
- Supervised vs unsupervised learning
- Classification and clustering
- Estimating cost of specific methods
- Filter
- Natural Language Processing
- Processing text
- Understaing meaning of the text
- Automatic text generation
- Sentiment/Topic Analysis
- Computer Vision
Big Data infrastructure
- Data Storage
- Relational databases (SQL) - when to use them
- Non-relational databases (NoSQL)
- Cassandra - just quick theoretical overview
- MongoDB - practical exercises
- Understanding the nuances: hierarchical, object-oriented, document-oriented, graph-oriented, etc.
- Search Engines
- ElasticSearch
- Distributed Processing (only theoretical background)
- Scalability
- Public cloud
- Azure etc.
- Private cloud
- OpenStack, Cloud Foundry, etc. (only theoretical overview)
- Auto-scalability
- Public cloud
- Choosing right solution for the problem