- An understanding of C/C++ language and parallel programming concepts
- Basic knowledge of computer architecture and memory hierarchy
- Experience with command-line tools and code editors
Audience
- Developers who wish to learn how to use ROCm and HIP to program AMD GPUs and exploit their parallelism
- Developers who wish to write high-performance and scalable code that can run on different AMD devices
- Programmers who wish to explore the low-level aspects of GPU programming and optimize their code performance
ROCm is an open source platform for GPU programming that supports AMD GPUs, and also provides compatibility with CUDA and OpenCL. ROCm exposes the programmer to the hardware details and gives full control over the parallelization process. However, this also requires a good understanding of the device architecture, memory model, execution model, and optimization techniques.
HIP is a C++ runtime API and kernel language that allows you to write portable code that can run on both AMD and NVIDIA GPUs. HIP provides a thin abstraction layer over the native GPU APIs, such as ROCm and CUDA, and allows you to leverage the existing GPU libraries and tools.
This instructor-led, live training (online or onsite) is aimed at beginner-level to intermediate-level developers who wish to use ROCm and HIP to program AMD GPUs and exploit their parallelism.
Day 1: Foundations of GPU Programming and the ROCm Platform
Morning Session
1. Welcome and Introduction
Course Overview
- Objectives and expectations
- The importance of GPU programming in high-throughput computing
Understanding the Challenge
- Limitations of CPU processing in data-intensive applications
- The need for efficient GPU-based data processing pipelines
2. Fundamentals of GPU Computing
GPU vs. CPU Architecture
- Exploring parallelism: GPU architecture deep dive
- Differences between SIMD and SIMT execution models
AMD GPU Architecture
- Overview of AMD's GPU hardware
- Memory hierarchy: global, shared, local, and constant memory
3. Core Concepts in GPU Programming
Thread Hierarchy
- Threads, warps (wavefronts), and blocks
- Organizing grids and blocks for optimal performance
SIMT Execution Model
- Single Instruction, Multiple Threads paradigm explained
Directive-Based Programming Models
- Introduction to OpenMP and OpenACC for GPU programming
Morning Exercise
Exercise 1: Exploring GPU Capabilities
- Write a program to query and display GPU device properties
- Analyze hardware capabilities and limitations
Afternoon Session
4. Setting Up the Development Environment
Installing ROCm and HIP on Linux
- System requirements and installation steps
- Verifying the installation
Configuring Development Tools
- Setting up compilers, debuggers, and profilers
- Integrating ROCm and HIP with Visual Studio Code
5. Writing Your First HIP Program
Hello, GPU World!
- Understanding the structure of a HIP program
- Writing and executing a simple kernel
- Compilation and execution process
6. Understanding Thread Hierarchies
Modifying Kernels
- Accessing thread and block indices within kernels
- Experimenting with different grid and block dimensions
Afternoon Exercise
Exercise 2: Vector Addition
- Implement vector addition on the GPU using HIP
- Compare performance with a CPU implementation
Evening Wrap-Up
Recap of Key Concepts
Q&A Session
Day 2: Advanced HIP Programming and Debugging Techniques
Morning Session
1. Deep Dive into HIP Programming
Memory Management in HIP
- Allocating and freeing device memory (hipMalloc, hipFree)
- Data transfer between host and device (hipMemcpy)
- Introduction to Unified Memory
Synchronization and Execution Control
- Thread coordination with __syncthreads()
- Understanding memory consistency and memory fences
2. Debugging HIP Applications
Common Errors and Bugs
- Identifying typical issues in GPU programming
Debugging Tools in ROCm
- Using rocgdb for kernel debugging
- Setting breakpoints and inspecting variables
Best Practices in Debugging
- Strategies for efficient debugging
- Interpreting error messages and fixing bugs
Morning Exercise
Exercise 3: Debugging Practice
- Introduce intentional bugs in a HIP program
- Use debugging tools to identify and fix the issues
Afternoon Session
3. Leveraging ROCm Libraries
Overview of ROCm Math Libraries
- Utilizing rocBLAS, rocFFT, and others for complex computations
Collective Communication with RCCL
- Introduction to RCCL (Radeon Collective Communication Library)
- Enabling communication between multiple GPUs
4. Multi-GPU Programming
Managing Multiple GPUs
- Techniques for device selection and management
- Workload distribution strategies
Collective Operations with RCCL
- Implementing broadcast, reduce, and all-reduce operations
- Synchronization across GPUs for efficient computation
Afternoon Exercise
Exercise 4: Multi-GPU Vector Addition with RCCL
- Extend vector addition to run on multiple GPUs
- Implement reduction operations using RCCL
Evening Wrap-Up
Recap of Key Concepts
Q&A Session
Day 3: Distributed Computing with MPI and RCCL
Morning Session
1. Quick Recap: DMA, RDMA, and P2P Communication
Understanding High-Speed Data Transfers
- Brief overview of DMA, RDMA, and P2P communication
- Relevance in GPU programming and data-intensive applications
2. Introduction to OpenMPI
Fundamentals of MPI
- Concepts of processes, communicators, and messages
- The role of MPI in distributed computing
Setting Up OpenMPI
- Installation and configuration for GPU-accelerated environments
Writing MPI Programs
- Basic MPI functions: MPI_Init, MPI_Comm_size, MPI_Comm_rank,
MPI_Finalize
- Sending and receiving messages with MPI_Send and MPI_Recv
Morning Exercise
Exercise 5: Hello World with MPI
- Write a simple MPI program to print messages from multiple processes
- Compile and run the program across multiple nodes
Afternoon Session
3. Integrating HIP with OpenMPI
Combining HIP and MPI
- Strategies for integrating GPU computations with MPI
- Managing data distribution and collection
Commonalities Between MPI and RCCL
- Understanding how RCCL complements MPI in GPU environments
- When to use MPI vs. RCCL for communication
4. Implementing Distributed GPU Applications
Designing Distributed Algorithms
- Decomposing problems for parallel execution
- Minimizing communication overhead
Data Management Across Nodes
- Efficient data transfer techniques
- Synchronization and consistency considerations
Afternoon Exercise
Exercise 6: Distributed Matrix Multiplication
- Implement matrix multiplication across multiple nodes using MPI and HIP
- Use RCCL for intra-node communication and MPI for inter-node communication
Evening Wrap-Up
Recap of Key Concepts
QA Session
Day 4: Performance Optimization and Advanced Topics
Morning Session
1. Performance Optimization Techniques
Memory Coalescing
- Strategies for efficient global memory access
- Aligning data structures for optimal performance
Maximizing GPU Occupancy
- Understanding the factors affecting occupancy
- Tuning block sizes and grid dimensions
Shared Memory Optimization
- Leveraging shared memory to reduce global memory access
- Avoiding bank conflicts and optimizing memory patterns
2. Advanced Synchronization Techniques
Atomic Operations
- Using atomic functions for synchronization
- Performance implications of atomic operations
Stream and Event Management
- Overlapping computation and data transfer
- Managing multiple streams for concurrency
Morning Exercise
Exercise 7: Optimizing Kernel Performance
- Profile a HIP kernel to identify bottlenecks
- Apply optimization techniques to improve performance
Afternoon Session
3. Profiling GPU Applications
Profiling Tools in ROCm
- Using rocprof and rocm-smi for performance analysis
Interpreting Profiling Data
- Understanding kernel execution metrics
- Identifying and addressing performance bottlenecks
4. Advanced Topics
Asynchronous Execution and Streams
- Managing asynchronous operations for better performance
HIP Graphs
- Introduction to HIP Graphs for complex task scheduling
- Benefits of using graphs for performance and resource management
Future Trends in GPU Computing
- Emerging technologies and their potential impact
- Direction of AMD ROCm and HIP developments
Afternoon Exercise
Exercise 8: Implementing HIP Graphs
- Convert a sequential set of kernel executions into a HIP Graph
- Analyze performance improvements
Evening Wrap-Up and Course Conclusion
Recap of Key Concepts
Final Q&A Session
Feedback and Next Steps