Bazy danych w budowaniu wysokowydajnych systemów rozproszonych

Course Code: bdbwsr

Duration: 21 hours

Overview:

The "Databases in Building High-Performance Distributed Systems" training is an intensive course that combines the theory of distributed systems with practical database usage in designing scalable, fault-tolerant applications. Over three days, participants will gain knowledge on:✅ Key challenges of distributed systems: consistency, availability, scalability✅ Design patterns: CQRS, Saga, Two-Phase Commit (2PC), Circuit Breaker✅ Modern databases: document-based (MongoDB, CouchDB), key-value (Redis), graph-based (Neo4j), column-based (HBase), object-oriented (GridGain), time-series (TimescaleDB, InfluxDB)✅ Techniques for partitioning, sharding, and replication to enhance system performance✅ Managing real-time data versus classical batch processing✅ Practical methods for implementing distributed transactions and data recovery after failuresThe training includes numerous practical workshops where participants will model, optimize, and deploy database systems in a distributed environment. The course is intended for programmers, system architects, and administrators who want to acquire skills in efficiently managing data in distributed systems.

Course Outline:

Day 1: Theory and Introduction to Distributed Systems

Introduction
- Introduction to the training structure and agenda, discussion of the training environment.
Basic Concepts of Distributed Systems
- Definition of distributed systems and their significance in modern applications.
- Key challenges: scalability, availability, consistency, fault tolerance.
Data Consistency Models
- Discussion of Strong Consistency and Eventual Consistency.
- Managing consistency in distributed systems: quorum, Read-Write Quorums, Read Your Own Writes.
Distributed Logging Systems and Communication
- The pub/sub pattern and stream-table dualism.
- Data compaction and real-time data processing.
Case Study 1: Example of High-Performance Applications
- Analysis of the architecture of communication systems (e.g., WhatsApp, Signal).
- Challenges related to consistency and data recovery.

Day 2: Practical Aspects of Designing Distributed Systems

Designing Fault-Tolerant Applications
- Discussion of patterns: CQRS, Inbox/Outbox, Two-Phase Commit (2PC), Saga, Change Data Capture (CDC), Circuit Breaker, Read Repair.
- Examples of practical applications.
Examples of Non-Relational Databases
- Document Databases (e.g., MongoDB, CouchDB):
- Key-Value Databases (e.g., Redis):
- Graph Databases (e.g., Neo4j, OrientDB):
- Columnar Databases (e.g., HBase):
- Object Databases (e.g., GridGain):
- Time-Series Databases (e.g., TimescaleDB, InfluxDB):
- Search Engines (e.g., Apache Solr):
- In-Memory Grids (e.g., Hazelcast, GridGain):
Modern Databases: Partitioning, Sharding, and Replication
- Partitioning and Sharding: Discussion of techniques for dividing data into smaller fragments to improve system performance and scalability.
- Data Replication: Different types of replication (synchronous, asynchronous), benefits, and challenges related to data replication in distributed environments.
- Secondary Indexes: Creating and optimizing queries using secondary indexes to improve performance.
Case Study 2: Designing a Graph-Based System
- Graph-based design and modeling in distributed systems using Neo4j or OrientDB.
- Practical exercise: graph modeling.
Managing Real-Time Data vs. Traditional Data Warehouses
- Introduction to real-time data processing and batch processing.
- Example of using Timescale for monitoring time-series data.
NewSQL – Modern Approach to Relational Databases
- Discussion of the NewSQL concept as a combination of the advantages of relational databases with the flexibility and scalability of NoSQL solutions.
- Task for NewSQL: Participants will familiarize themselves with the most popular NewSQL databases and will work with CockroachDB in a practical task. The goal of the task will be to implement transactions with ACID guarantees in a distributed environment.

Day 3: Practical Exercises and Database Optimization

Practical Tasks Using Non-Relational Databases:
- Task for MongoDB: Creating complex queries with data aggregation.
  - Participants will work on creating queries using the MongoDB pipeline, grouping, and filtering data in real-time.
- Task for Redis: Implementing caching mechanisms using Redis.
  - Participants will build a system for storing query results in Redis to optimize read performance.
- Task for CouchDB: Data synchronization in CouchDB using replication functions.
  - The task includes configuring replication between two CouchDB instances and analyzing data conflicts.
- Task for Neo4j: Optimizing Cypher queries in a graph database.
  - Participants will analyze a large graph and build query optimizations to find dependencies between nodes.
- Task for InfluxDB: Processing time-series data and optimizing data retention.
  - Exercises using InfluxQL for data flow analysis and setting retention strategies.
- Task for GridGain: Processing data using GridGain, building and optimizing queries in an object-oriented environment.
  - Participants will optimize the storage and retrieval of large objects.
- Task for Apache Solr: Implementing full-text search using Solr.
  - Creating indexes and optimizing search queries on large datasets.
Task: Distributed Transactions
- Participants will learn about the concept of distributed transactions, including mechanisms that ensure consistency in distributed systems.
- Practical exercise: Implementing distributed transactions using patterns such as Two-Phase Commit (2PC) or Saga.
Data Recovery After Failures and Backups
- Patterns: Last-Writer-Wins, Vector Clocks, CRDT.
- Strategies for data recovery after failures.
Summary and Discussion
- Q&A session and exchange of experiences.

The training does not cover relational databases, Elasticsearch, Apache Kafka, Prometheus, Cassandra.

Sites Published:

Polska - Bazy danych w budowaniu wysokowydajnych systemów rozproszonych