Name	Name	Last commit message	Last commit date
parent directory ..
01_shared_memory_transpose_cuda.cu	01_shared_memory_transpose_cuda.cu
01_shared_memory_transpose_hip.cpp	01_shared_memory_transpose_hip.cpp
02_memory_coalescing_cuda.cu	02_memory_coalescing_cuda.cu
02_memory_coalescing_hip.cpp	02_memory_coalescing_hip.cpp
03_texture_memory_cuda.cu	03_texture_memory_cuda.cu
03_texture_memory_hip.cpp	03_texture_memory_hip.cpp
04_unified_memory_cuda.cu	04_unified_memory_cuda.cu
04_unified_memory_hip.cpp	04_unified_memory_hip.cpp
05_memory_bandwidth_optimization_cuda.cu	05_memory_bandwidth_optimization_cuda.cu
05_memory_bandwidth_optimization_hip.cpp	05_memory_bandwidth_optimization_hip.cpp
Makefile	Makefile
README.md	README.md
rocm7_utils.h	rocm7_utils.h

Module 2: Advanced GPU Memory Management Examples

This directory contains practical examples for GPU memory optimization techniques using both CUDA and HIP. These examples accompany Module 2 and focus on shared memory tiling, memory coalescing, texture/read-only memory, unified memory, and bandwidth optimization.

Learning Objectives

Master shared memory optimization for high-performance computing
Understand memory coalescing and its impact on bandwidth
Learn texture memory usage for spatial locality patterns
Explore unified memory programming models
Optimize memory bandwidth utilization techniques
Analyze memory access patterns and bottlenecks

Examples Overview

01. Shared Memory Matrix Transpose

Files: 01_shared_memory_transpose_cuda.cu, 01_shared_memory_transpose_hip.cpp

Demonstrates shared memory optimization through matrix transpose:

Naive vs shared memory implementations
Bank conflict avoidance techniques
Cross-platform HIP optimizations (AMD vs NVIDIA)
Performance analysis and bandwidth measurements

02. Memory Coalescing Analysis

Files: 02_memory_coalescing_cuda.cu, 02_memory_coalescing_hip.cpp

Comprehensive analysis of memory access patterns:

Structure of Arrays (SoA) vs Array of Structures (AoS)
Strided memory access patterns
Vectorized memory operations (float4)
Platform-specific optimizations

03. Texture Memory Optimization (CUDA Only)

File: 03_texture_memory_cuda.cu

Advanced texture memory usage for spatial locality:

Modern texture object API
Hardware interpolation and filtering
Boundary condition handling
Texture-based convolution and image processing

04. Unified Memory Programming (CUDA Only)

File: 04_unified_memory_cuda.cu

Unified Memory programming model and optimization:

Automatic data migration between CPU and GPU
Memory prefetching and hints
Performance comparison with explicit management
Memory pool management

05. Memory Bandwidth Optimization (CUDA Only)

File: 05_memory_bandwidth_optimization_cuda.cu

Comprehensive memory bandwidth optimization techniques:

Vectorized memory operations
Streaming memory patterns
Pinned memory benefits
Memory hierarchy utilization

Building and Running Examples

Prerequisites

CUDA Toolkit 13.0+ (for CUDA examples)
ROCm 7.0+ (for HIP examples)
Compatible GPU (NVIDIA or AMD)
C++17 compatible compiler

Quick Start

 # Build all examples (auto-detects your GPU)
make

# Run performance tests
make test

# Build specific example
make 01_shared_memory_transpose_cuda

Performance Analysis

Profiling Commands

NVIDIA Nsight Compute:

# Memory bandwidth analysis
ncu --metrics dram__throughput.avg.pct_of_peak_sustained_elapsed ./build/01_shared_memory_transpose_cuda

# Memory coalescing efficiency
ncu --metrics l1tex__throughput.avg.pct_of_peak_sustained_elapsed ./build/02_memory_coalescing_cuda

AMD ROCProfiler:

# HIP memory analysis
rocprof --hip-trace ./build/01_shared_memory_transpose_hip

# Detailed memory metrics
rocprof --stats ./build/02_memory_coalescing_hip

Expected Performance Improvements

Shared Memory Optimizations: 2-5x speedup over naive approaches
Memory Coalescing: 2-10x performance difference between coalesced vs strided access
Texture Memory: 1.5-3x speedup for spatial locality patterns

Notes

These examples are designed to be educational and performance-oriented. Use the provided Docker environment for consistent toolchains (CUDA 13.0.1, ROCm 7.0). Binaries are emitted to the build/ directory by the Makefile.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

Module 2: Advanced GPU Memory Management Examples

Learning Objectives

Examples Overview

01. Shared Memory Matrix Transpose

02. Memory Coalescing Analysis

03. Texture Memory Optimization (CUDA Only)

04. Unified Memory Programming (CUDA Only)

05. Memory Bandwidth Optimization (CUDA Only)

Building and Running Examples

Prerequisites

Quick Start

Performance Analysis

Profiling Commands

Expected Performance Improvements

Notes

Uh oh!

FilesExpand file tree

examples

Directory actions

More options

Directory actions

More options

Latest commit

History

examples

Folders and files

parent directory

README.md

Module 2: Advanced GPU Memory Management Examples

Learning Objectives

Examples Overview

01. Shared Memory Matrix Transpose

02. Memory Coalescing Analysis

03. Texture Memory Optimization (CUDA Only)

04. Unified Memory Programming (CUDA Only)

05. Memory Bandwidth Optimization (CUDA Only)

Building and Running Examples

Prerequisites

Quick Start

Performance Analysis

Profiling Commands

Expected Performance Improvements

Notes