ROCm 7 follows the Linux FHS layout. If hipcc reports:
cannot find HIP runtime; provide its path via '--rocm-path'
set ROCM_PATH to your ROCm root (defaults to /opt/rocm) or let the Makefile auto-detect from hipcc.
Examples:
- Export a custom path:
export ROCM_PATH=/opt/rocm - Verify headers exist:
ls $ROCM_PATH/include/hip/hip_runtime.h
Reference: ROCm File Structure Reorg docs.
This directory contains practical examples that accompany Module 1 of the GPU Programming 101 course. These examples demonstrate the core concepts of CUDA and HIP programming.
| File | Description | Key Concepts |
|---|---|---|
01_vector_addition_cuda.cu |
Basic CUDA vector addition with error handling | Kernels, memory management, error checking |
03_matrix_addition_cuda.cu |
2D matrix addition with thread indexing | 2D threading, grid configuration |
04_device_info_cuda.cu |
Query and display GPU properties | Device queries, capability checking |
05_performance_comparison.cu |
CPU vs GPU performance benchmark | Performance analysis, timing |
06_debug_example.cu |
Debugging techniques and occupancy analysis | Error handling, profiling, optimization |
| File | Description | Key Concepts |
|---|---|---|
02_vector_addition_hip.cpp |
Cross-platform vector addition using HIP | HIP API, portability |
03_matrix_addition_hip.cpp |
2D matrix addition with HIP | Cross-platform 2D threading |
04_device_info_hip.cpp |
HIP device properties and platform detection | HIP device queries, platform abstraction |
05_performance_comparison_hip.cpp |
HIP performance analysis with bandwidth testing | HIP performance, memory bandwidth |
06_debug_example_hip.cpp |
HIP debugging and optimization techniques | HIP debugging, occupancy analysis |
07_cross_platform_comparison.cpp |
AMD vs NVIDIA optimization comparison | Platform-specific optimizations, portability |
- NVIDIA GPU with compute capability 5.0+
- NVIDIA drivers 580+ recommended
- CUDA Toolkit 13.0+ (Docker uses CUDA 13.0.1)
- GCC/Clang compiler
- AMD GPU with ROCm support OR NVIDIA GPU
- ROCm 7.0+ (for AMD) or CUDA 13.0+ (for NVIDIA backend)
- HIP compiler (hipcc)
# Build all CUDA examples
make cuda
# Build all HIP examples
make hip
# Run tests
make test
# Clean build files
make clean
# Show help
make helpBinaries are written to build/ by the Makefile.
CUDA Examples:
nvcc -o build/01_vector_addition_cuda 01_vector_addition_cuda.cu
nvcc -o build/03_matrix_addition_cuda 03_matrix_addition_cuda.cu
nvcc -o build/04_device_info_cuda 04_device_info_cuda.cu
nvcc -o build/05_performance_comparison_cuda 05_performance_comparison_cuda.cu
nvcc -o build/06_debug_example_cuda 06_debug_example_cuda.cuHIP Examples:
hipcc -o build/02_vector_addition_hip 02_vector_addition_hip.cpp
hipcc -o build/03_matrix_addition_hip 03_matrix_addition_hip.cpp
hipcc -o build/04_device_info_hip 04_device_info_hip.cpp
hipcc -o build/05_performance_comparison_hip 05_performance_comparison_hip.cpp
hipcc -o build/06_debug_example_hip 06_debug_example_hip.cpp
hipcc -o build/07_cross_platform_comparison 07_cross_platform_comparison.cppFile: 01_vector_addition_cuda.cu
Demonstrates:
- Basic kernel structure
- Memory allocation and transfer
- Thread indexing
- Error handling with macros
Usage:
make
./build/01_vector_addition_cudaExpected Output:
Launching kernel with 4 blocks of 256 threads each
Verification (first 5 elements):
0.000 + 1.000 = 1.000
0.708 + 0.293 = 1.000
...
Vector addition completed successfully!
File: 02_vector_addition_hip.cpp
Demonstrates:
- Cross-platform GPU programming
- HIP API usage
- Device property queries
- Portability between AMD and NVIDIA
Usage:
make hip
./build/02_vector_addition_hipFile: 03_matrix_addition_cuda.cu
Demonstrates:
- 2D thread indexing
- 2D grid and block configuration
- Boundary checking for matrices
- Performance with larger datasets
Usage:
make
./build/03_matrix_addition_cudaFile: 03_matrix_addition_hip.cpp
Demonstrates:
- Cross-platform 2D matrix operations
- HIP-specific device queries
- Memory usage reporting
- Platform-agnostic thread indexing
Usage:
make hip
./build/03_matrix_addition_hipFile: 04_device_info_cuda.cu
Demonstrates:
- Querying GPU capabilities
- Memory information
- Compute capability checking
- Multi-GPU systems
Usage:
make
./build/04_device_info_cudaFile: 04_device_info_hip.cpp
Demonstrates:
- Cross-platform device queries
- AMD vs NVIDIA feature detection
- HIP runtime and driver versions
- Platform-specific properties
Usage:
make hip
./build/04_device_info_hipFile: 05_performance_comparison.cu
Demonstrates:
- CPU vs GPU benchmarking
- Memory bandwidth analysis
- Performance scaling with problem size
- Timing with CUDA events
Usage:
make
./build/05_performance_comparison_cudaFile: 05_performance_comparison_hip.cpp
Demonstrates:
- Cross-platform performance analysis
- HIP event-based timing
- Memory bandwidth efficiency
- Block size optimization analysis
- Platform-specific performance characteristics
Usage:
make hip
./build/05_performance_comparison_hipFile: 06_debug_example.cu
Demonstrates:
- Debug printf in kernels
- Occupancy analysis
- Shared memory usage
- Error checking techniques
Usage:
make debug
./build/06_debug_example_cudaFile: 06_debug_example_hip.cpp
Demonstrates:
- HIP debugging techniques
- Cross-platform occupancy analysis
- HIP event timing
- Platform-specific feature detection
- Warp-level operations
Usage:
make debug hip
./build/06_debug_example_hipFile: 07_cross_platform_comparison.cpp
Demonstrates:
- Writing portable HIP code
- AMD vs NVIDIA optimizations
- Platform-specific feature detection
- Performance comparison across platforms
- Memory bandwidth analysis
Usage:
make cross_platform
./cross_platform"nvcc: command not found"
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH"No CUDA-capable device"
- Check
nvidia-smioutput - Verify driver installation
- Ensure GPU is not in exclusive mode
HIP compilation issues
# For AMD GPUs
export HIP_PLATFORM=amd
# For NVIDIA GPUs
export HIP_PLATFORM=nvidia"CUDA error: out of memory"
- Reduce problem size
- Check available memory with device_info example
- Use memory profiling tools
"CUDA error: invalid configuration"
- Check block size limits with device_info
- Ensure block size ≤ 1024 threads
- Verify grid dimensions are within limits
-
Block Size Selection
- Use multiples of 32 (warp size)
- Common choices: 128, 256, 512
- Use occupancy calculator for optimization
-
Memory Access
- Prefer coalesced memory access patterns
- Minimize CPU-GPU data transfers
- Use appropriate memory types (global, shared, constant)
-
Thread Divergence
- Avoid conditional branches within warps
- Restructure algorithms to minimize divergence
- Start with
01_vector_addition_cuda.cuto understand basics - Explore
04_device_info_cuda.cuto learn about your GPU - Try
03_matrix_addition_cuda.cufor 2D indexing - Run
05_performance_comparison.cuto see GPU advantages - Use
06_debug_example.cuto learn debugging techniques - Experiment with
02_vector_addition_hip.cppfor portability
- Start with
02_vector_addition_hip.cppfor HIP basics - Explore
04_device_info_hip.cppto understand your platform - Try
03_matrix_addition_hip.cppfor 2D operations - Run
05_performance_comparison_hip.cppfor benchmarking - Use
06_debug_example_hip.cppfor debugging techniques - Test
07_cross_platform_comparison.cppfor optimization
- Run corresponding CUDA and HIP examples side by side
- Compare performance characteristics
- Test portability with
07_cross_platform_comparison.cpp - Experiment with platform-specific optimizations
Try these modifications to deepen your understanding:
- Modify vector addition to use different block sizes (64, 512, 1024)
- Add timing to measure kernel execution time
- Implement element-wise mathematical operations (sin, cos, exp)
- Create a 2D matrix multiplication kernel
- Add input validation and better error handling
- Port CUDA examples to HIP using hipify tools
Happy GPU programming! 🚀