English | 简体中文
Progressive CUDA SGEMM tutorial and reference implementation. The repository contains five hand-written kernel variants, cuBLAS-backed verification, a benchmark harness, and OpenSpec-governed repository rules for keeping the project compact and trustworthy.
- Show the optimization ladder clearly: naive -> tiled -> bank-conflict-free -> double-buffered -> Tensor Core WMMA
- Stay readable: each optimization lives in its own kernel file and keeps a consistent launch interface
- Stay verifiable: kernels are checked against cuBLAS, with separate tolerances for FP32 and Tensor Core paths
- Stay maintainable: the repository uses OpenSpec to keep docs, workflow, and validation rules aligned
| Stage | File | Main idea |
|---|---|---|
| Naive | src/kernels/naive_sgemm.cuh |
Baseline triple-loop mapping |
| Tiled | src/kernels/tiled_sgemm.cuh |
Shared-memory blocking |
| Bank-Free | src/kernels/bank_conflict_free_sgemm.cuh |
[TILE_SIZE][TILE_SIZE+1] padding |
| Double Buffer | src/kernels/double_buffer_sgemm.cuh |
Tile staging overlap and latency hiding |
| Tensor Core | src/kernels/tensor_core_sgemm.cuh |
WMMA path with safe FP32 fallback |
git clone https://github.com/LessUp/sgemm-optimization.git
cd sgemm-optimization
# Recommended: CMake
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
./build/bin/sgemm_benchmark -a
ctest --test-dir build# Quick local alternative
make GPU_ARCH=sm_86
make benchmark
make test- Local GPU machine: runtime tests, correctness checks, and benchmarking
- GitHub Actions: format/style, CUDA compile validation, OpenSpec/repository checks, and Pages deployment
Standard FP32 kernels use rtol=1e-3, atol=1e-4. The Tensor Core path uses rtol=5e-2, atol=1e-2.
- Getting Started
- Learning Path
- Architecture Overview
- Benchmark Notes
- Specifications Index
- GitHub Pages site
src/
├── kernels/ # Five SGEMM kernel variants
├── utils/ # CUDA RAII, verification, benchmark helpers
└── main.cu # Benchmark entry point
tests/
└── test_sgemm.cu # Google Test suite
docs/ # Public learning-oriented documentation
openspec/ # Stable specs, changes, and workflow guidance
Non-trivial repository changes are expected to follow:
/opsx:explore/opsx:propose "description"/opsx:apply/review/opsx:archive
The stable authoritative specs live under openspec/specs/. Active implementation plans live under openspec/changes/<change>/.
MIT. See LICENSE.