AMD GPU Programming - Bespoke

Course Code: amdgpuprobesp

Duration: 28 hours

Prerequisites:

An understanding of C/C++ language and parallel programming concepts
Basic knowledge of computer architecture and memory hierarchy
Experience with command-line tools and code editors

Audience

Developers who wish to learn how to use ROCm and HIP to program AMD GPUs and exploit their parallelism
Developers who wish to write high-performance and scalable code that can run on different AMD devices
Programmers who wish to explore the low-level aspects of GPU programming and optimize their code performance

Overview:

ROCm is an open source platform for GPU programming that supports AMD GPUs, and also provides compatibility with CUDA and OpenCL. ROCm exposes the programmer to the hardware details and gives full control over the parallelization process. However, this also requires a good understanding of the device architecture, memory model, execution model, and optimization techniques.

HIP is a C++ runtime API and kernel language that allows you to write portable code that can run on both AMD and NVIDIA GPUs. HIP provides a thin abstraction layer over the native GPU APIs, such as ROCm and CUDA, and allows you to leverage the existing GPU libraries and tools.

This instructor-led, live training (online or onsite) is aimed at beginner-level to intermediate-level developers who wish to use ROCm and HIP to program AMD GPUs and exploit their parallelism.

Course Outline:

Day 1: Foundations of GPU Programming and the ROCm Platform

Morning Session

1. Welcome and Introduction

Course Overview

Objectives and expectations
The importance of GPU programming in high-throughput computing

Understanding the Challenge

Limitations of CPU processing in data-intensive applications
The need for efficient GPU-based data processing pipelines

2. Fundamentals of GPU Computing

GPU vs. CPU Architecture

Exploring parallelism: GPU architecture deep dive
Differences between SIMD and SIMT execution models

AMD GPU Architecture

Overview of AMD's GPU hardware
Memory hierarchy: global, shared, local, and constant memory

3. Core Concepts in GPU Programming

Thread Hierarchy

Threads, warps (wavefronts), and blocks
Organizing grids and blocks for optimal performance

SIMT Execution Model

Single Instruction, Multiple Threads paradigm explained

Directive-Based Programming Models

Introduction to OpenMP and OpenACC for GPU programming

Morning Exercise

Exercise 1: Exploring GPU Capabilities

Write a program to query and display GPU device properties
Analyze hardware capabilities and limitations

Afternoon Session

4. Setting Up the Development Environment

Installing ROCm and HIP on Linux

System requirements and installation steps
Verifying the installation

Configuring Development Tools

Setting up compilers, debuggers, and profilers
Integrating ROCm and HIP with Visual Studio Code

5. Writing Your First HIP Program

Hello, GPU World!

Understanding the structure of a HIP program
Writing and executing a simple kernel
Compilation and execution process

6. Understanding Thread Hierarchies

Modifying Kernels

Accessing thread and block indices within kernels
Experimenting with different grid and block dimensions

Afternoon Exercise

Exercise 2: Vector Addition

Implement vector addition on the GPU using HIP
Compare performance with a CPU implementation

Evening Wrap-Up

Recap of Key Concepts

Q&A Session

Day 2: Advanced HIP Programming and Debugging Techniques

Morning Session

1. Deep Dive into HIP Programming

Memory Management in HIP

Allocating and freeing device memory (hipMalloc, hipFree)
Data transfer between host and device (hipMemcpy)
Introduction to Unified Memory

Synchronization and Execution Control

Thread coordination with __syncthreads()
Understanding memory consistency and memory fences

2. Debugging HIP Applications

Common Errors and Bugs

Identifying typical issues in GPU programming

Debugging Tools in ROCm

Using rocgdb for kernel debugging
Setting breakpoints and inspecting variables

Best Practices in Debugging

Strategies for efficient debugging
Interpreting error messages and fixing bugs

Morning Exercise

Exercise 3: Debugging Practice

Introduce intentional bugs in a HIP program
Use debugging tools to identify and fix the issues

Afternoon Session

3. Leveraging ROCm Libraries

Overview of ROCm Math Libraries

Utilizing rocBLAS, rocFFT, and others for complex computations

Collective Communication with RCCL

Introduction to RCCL (Radeon Collective Communication Library)
Enabling communication between multiple GPUs

4. Multi-GPU Programming

Managing Multiple GPUs

Techniques for device selection and management
Workload distribution strategies

Collective Operations with RCCL

Implementing broadcast, reduce, and all-reduce operations
Synchronization across GPUs for efficient computation

Afternoon Exercise

Exercise 4: Multi-GPU Vector Addition with RCCL

Extend vector addition to run on multiple GPUs
Implement reduction operations using RCCL

Evening Wrap-Up

Recap of Key Concepts

Q&A Session

Day 3: Distributed Computing with MPI and RCCL

Morning Session

1. Quick Recap: DMA, RDMA, and P2P Communication

Understanding High-Speed Data Transfers

Brief overview of DMA, RDMA, and P2P communication
Relevance in GPU programming and data-intensive applications

2. Introduction to OpenMPI

Fundamentals of MPI

Concepts of processes, communicators, and messages
The role of MPI in distributed computing

Setting Up OpenMPI

Installation and configuration for GPU-accelerated environments

Writing MPI Programs

Basic MPI functions: MPI_Init, MPI_Comm_size, MPI_Comm_rank,

MPI_Finalize

Sending and receiving messages with MPI_Send and MPI_Recv

Morning Exercise

Exercise 5: Hello World with MPI

Write a simple MPI program to print messages from multiple processes
Compile and run the program across multiple nodes

Afternoon Session

3. Integrating HIP with OpenMPI

Combining HIP and MPI

Strategies for integrating GPU computations with MPI
Managing data distribution and collection

Commonalities Between MPI and RCCL

Understanding how RCCL complements MPI in GPU environments
When to use MPI vs. RCCL for communication

4. Implementing Distributed GPU Applications

Designing Distributed Algorithms

Decomposing problems for parallel execution
Minimizing communication overhead

Data Management Across Nodes

Efficient data transfer techniques
Synchronization and consistency considerations

Afternoon Exercise

Exercise 6: Distributed Matrix Multiplication

Implement matrix multiplication across multiple nodes using MPI and HIP
Use RCCL for intra-node communication and MPI for inter-node communication

Evening Wrap-Up

Recap of Key Concepts

QA Session

Day 4: Performance Optimization and Advanced Topics

Morning Session

1. Performance Optimization Techniques

Memory Coalescing

Strategies for efficient global memory access
Aligning data structures for optimal performance

Maximizing GPU Occupancy

Understanding the factors affecting occupancy
Tuning block sizes and grid dimensions

Shared Memory Optimization

Leveraging shared memory to reduce global memory access
Avoiding bank conflicts and optimizing memory patterns

2. Advanced Synchronization Techniques

Atomic Operations

Using atomic functions for synchronization
Performance implications of atomic operations

Stream and Event Management

Overlapping computation and data transfer
Managing multiple streams for concurrency

Morning Exercise

Exercise 7: Optimizing Kernel Performance

Profile a HIP kernel to identify bottlenecks
Apply optimization techniques to improve performance

Afternoon Session

3. Profiling GPU Applications

Profiling Tools in ROCm

Using rocprof and rocm-smi for performance analysis

Interpreting Profiling Data

Understanding kernel execution metrics
Identifying and addressing performance bottlenecks

4. Advanced Topics

Asynchronous Execution and Streams

Managing asynchronous operations for better performance

HIP Graphs

Introduction to HIP Graphs for complex task scheduling
Benefits of using graphs for performance and resource management

Future Trends in GPU Computing

Emerging technologies and their potential impact
Direction of AMD ROCm and HIP developments

Afternoon Exercise

Exercise 8: Implementing HIP Graphs

Convert a sequential set of kernel executions into a HIP Graph
Analyze performance improvements

Evening Wrap-Up and Course Conclusion

Recap of Key Concepts

Final Q&A Session

Feedback and Next Steps