Course Code: amdgpuprobesp
Duration: 28 hours
Prerequisites:
  • An understanding of C/C++ language and parallel programming concepts
  • Basic knowledge of computer architecture and memory hierarchy
  • Experience with command-line tools and code editors

Audience

  • Developers who wish to learn how to use ROCm and HIP to program AMD GPUs and exploit their parallelism
  • Developers who wish to write high-performance and scalable code that can run on different AMD devices
  • Programmers who wish to explore the low-level aspects of GPU programming and optimize their code performance
Overview:

ROCm is an open source platform for GPU programming that supports AMD GPUs, and also provides compatibility with CUDA and OpenCL. ROCm exposes the programmer to the hardware details and gives full control over the parallelization process. However, this also requires a good understanding of the device architecture, memory model, execution model, and optimization techniques.

HIP is a C++ runtime API and kernel language that allows you to write portable code that can run on both AMD and NVIDIA GPUs. HIP provides a thin abstraction layer over the native GPU APIs, such as ROCm and CUDA, and allows you to leverage the existing GPU libraries and tools.

This instructor-led, live training (online or onsite) is aimed at beginner-level to intermediate-level developers who wish to use ROCm and HIP to program AMD GPUs and exploit their parallelism.

Course Outline:

Day 1: Foundations of GPU Programming and the ROCm Platform

 

Morning Session

1. Welcome and Introduction

Course Overview

  • Objectives and expectations
  • The importance of GPU programming in high-throughput computing

Understanding the Challenge

  • Limitations of CPU processing in data-intensive applications
  • The need for efficient GPU-based data processing pipelines

2. Fundamentals of GPU Computing

GPU vs. CPU Architecture

  • Exploring parallelism: GPU architecture deep dive
  • Differences between SIMD and SIMT execution models

AMD GPU Architecture

  • Overview of AMD's GPU hardware
  • Memory hierarchy: global, shared, local, and constant memory

3. Core Concepts in GPU Programming

Thread Hierarchy

  • Threads, warps (wavefronts), and blocks
  • Organizing grids and blocks for optimal performance

SIMT Execution Model

  • Single Instruction, Multiple Threads paradigm explained

Directive-Based Programming Models

  • Introduction to OpenMP and OpenACC for GPU programming

Morning Exercise

 

Exercise 1: Exploring GPU Capabilities

  • Write a program to query and display GPU device properties
  • Analyze hardware capabilities and limitations

 

Afternoon Session

 

4. Setting Up the Development Environment

Installing ROCm and HIP on Linux

  • System requirements and installation steps
  • Verifying the installation

Configuring Development Tools

  • Setting up compilers, debuggers, and profilers
  • Integrating ROCm and HIP with Visual Studio Code

5. Writing Your First HIP Program

Hello, GPU World!

  • Understanding the structure of a HIP program
  • Writing and executing a simple kernel
  • Compilation and execution process

6. Understanding Thread Hierarchies

Modifying Kernels

  • Accessing thread and block indices within kernels
  • Experimenting with different grid and block dimensions

 

Afternoon Exercise

Exercise 2: Vector Addition

  • Implement vector addition on the GPU using HIP
  • Compare performance with a CPU implementation

Evening Wrap-Up

Recap of Key Concepts

Q&A Session

Day 2: Advanced HIP Programming and Debugging Techniques

Morning Session

1. Deep Dive into HIP Programming

Memory Management in HIP

  • Allocating and freeing device memory (hipMalloc, hipFree)
  • Data transfer between host and device (hipMemcpy)
  • Introduction to Unified Memory

Synchronization and Execution Control

  • Thread coordination with __syncthreads()
  • Understanding memory consistency and memory fences

2. Debugging HIP Applications

Common Errors and Bugs

  • Identifying typical issues in GPU programming

Debugging Tools in ROCm

  • Using rocgdb for kernel debugging
  • Setting breakpoints and inspecting variables

Best Practices in Debugging

  • Strategies for efficient debugging
  • Interpreting error messages and fixing bugs

Morning Exercise

Exercise 3: Debugging Practice

  • Introduce intentional bugs in a HIP program
  • Use debugging tools to identify and fix the issues

 

Afternoon Session

3. Leveraging ROCm Libraries

Overview of ROCm Math Libraries

  • Utilizing rocBLAS, rocFFT, and others for complex computations

Collective Communication with RCCL

  • Introduction to RCCL (Radeon Collective Communication Library)
  • Enabling communication between multiple GPUs

4. Multi-GPU Programming

Managing Multiple GPUs

  • Techniques for device selection and management
  • Workload distribution strategies

Collective Operations with RCCL

  • Implementing broadcast, reduce, and all-reduce operations
  • Synchronization across GPUs for efficient computation

Afternoon Exercise

Exercise 4: Multi-GPU Vector Addition with RCCL

  • Extend vector addition to run on multiple GPUs
  • Implement reduction operations using RCCL

Evening Wrap-Up

Recap of Key Concepts

Q&A Session

Day 3: Distributed Computing with MPI and RCCL

Morning Session

1. Quick Recap: DMA, RDMA, and P2P Communication

Understanding High-Speed Data Transfers

  • Brief overview of DMA, RDMA, and P2P communication
  • Relevance in GPU programming and data-intensive applications

2. Introduction to OpenMPI

Fundamentals of MPI

  • Concepts of processes, communicators, and messages
  • The role of MPI in distributed computing

Setting Up OpenMPI

  • Installation and configuration for GPU-accelerated environments

Writing MPI Programs

  • Basic MPI functions: MPI_Init, MPI_Comm_size, MPI_Comm_rank,

MPI_Finalize

  • Sending and receiving messages with MPI_Send and MPI_Recv

Morning Exercise

Exercise 5: Hello World with MPI

  • Write a simple MPI program to print messages from multiple processes
  • Compile and run the program across multiple nodes

Afternoon Session

3. Integrating HIP with OpenMPI

Combining HIP and MPI

  • Strategies for integrating GPU computations with MPI
  • Managing data distribution and collection

Commonalities Between MPI and RCCL

  • Understanding how RCCL complements MPI in GPU environments
  • When to use MPI vs. RCCL for communication

4. Implementing Distributed GPU Applications

Designing Distributed Algorithms

  • Decomposing problems for parallel execution
  • Minimizing communication overhead

Data Management Across Nodes

  • Efficient data transfer techniques
  • Synchronization and consistency considerations

Afternoon Exercise

Exercise 6: Distributed Matrix Multiplication

  • Implement matrix multiplication across multiple nodes using MPI and HIP
  • Use RCCL for intra-node communication and MPI for inter-node communication

Evening Wrap-Up

Recap of Key Concepts

QA Session

Day 4: Performance Optimization and Advanced Topics

Morning Session

1. Performance Optimization Techniques

Memory Coalescing

  • Strategies for efficient global memory access
  • Aligning data structures for optimal performance

Maximizing GPU Occupancy

  • Understanding the factors affecting occupancy
  • Tuning block sizes and grid dimensions

Shared Memory Optimization

  • Leveraging shared memory to reduce global memory access
  • Avoiding bank conflicts and optimizing memory patterns

2. Advanced Synchronization Techniques

Atomic Operations

  • Using atomic functions for synchronization
  • Performance implications of atomic operations

Stream and Event Management

  • Overlapping computation and data transfer
  • Managing multiple streams for concurrency

Morning Exercise

Exercise 7: Optimizing Kernel Performance

  • Profile a HIP kernel to identify bottlenecks
  • Apply optimization techniques to improve performance

Afternoon Session

3. Profiling GPU Applications

Profiling Tools in ROCm

  • Using rocprof and rocm-smi for performance analysis

Interpreting Profiling Data

  • Understanding kernel execution metrics
  • Identifying and addressing performance bottlenecks

4. Advanced Topics

Asynchronous Execution and Streams

  • Managing asynchronous operations for better performance

HIP Graphs

  • Introduction to HIP Graphs for complex task scheduling
  • Benefits of using graphs for performance and resource management

Future Trends in GPU Computing

  • Emerging technologies and their potential impact
  • Direction of AMD ROCm and HIP developments

Afternoon Exercise

Exercise 8: Implementing HIP Graphs

  • Convert a sequential set of kernel executions into a HIP Graph
  • Analyze performance improvements

Evening Wrap-Up and Course Conclusion

Recap of Key Concepts

Final Q&A Session

Feedback and Next Steps