Summary
Add atomic memory operation words for safe concurrent updates from multiple threads.
Words to implement
Integer atomics
| Word |
Stack effect |
MLIR op |
Description |
ATOMIC+ |
( n addr -- ) |
memref.atomic_rmw addi |
Atomic add (i64) |
ATOMIC-MAX |
( n addr -- ) |
memref.atomic_rmw maxs |
Atomic signed max (i64) |
ATOMIC-MIN |
( n addr -- ) |
memref.atomic_rmw mins |
Atomic signed min (i64) |
ATOMIC-AND |
( n addr -- ) |
memref.atomic_rmw andi |
Atomic bitwise AND |
ATOMIC-OR |
( n addr -- ) |
memref.atomic_rmw ori |
Atomic bitwise OR |
ATOMIC-XOR |
( n addr -- ) |
memref.atomic_rmw xori |
Atomic bitwise XOR |
ATOMIC-XCHG |
( n addr -- old ) |
memref.atomic_rmw assign |
Atomic exchange, returns old value |
ATOMIC-CAS |
( expected new addr -- old ) |
memref.generic_atomic_rmw |
Compare-and-swap, returns old value |
Float atomics
| Word |
Stack effect |
MLIR op |
Description |
ATOMIC-F+ |
( f addr -- ) |
memref.atomic_rmw addf |
Atomic float add |
ATOMIC-FMAX |
( f addr -- ) |
memref.atomic_rmw maximumf |
Atomic float max |
ATOMIC-FMIN |
( f addr -- ) |
memref.atomic_rmw minimumf |
Atomic float min |
Motivation
- Multi-block reductions: When a reduction spans more than one thread block, the output must be accumulated atomically (e.g.,
ATOMIC-F+ for partial sums, ATOMIC-FMAX for global max).
- Histogram / scatter patterns: Common GPU patterns where multiple threads update the same output location.
- Lock-free data structures:
ATOMIC-CAS enables lock-free algorithms.
- Flash attention: Multi-block flash attention variants need atomic output accumulation.
Implementation notes
- Integer atomics: straightforward mapping to
memref.atomic_rmw with the appropriate arith::AtomicRMWKind.
- Float atomics: values are i64 bit patterns on the stack, so bitcast to f64 before the atomic op. The address computation follows the same pattern as
! / F!.
ATOMIC-CAS is more complex: needs memref.generic_atomic_rmw with a comparison body, or lower directly to an LLVM cmpxchg.
- NVVM has native support for all of these via PTX
atom.* instructions.
- Consider starting with just
ATOMIC+ and ATOMIC-F+ as the minimum viable set.
Files to modify
include/warpforth/Dialect/Forth/ForthOps.td — Define new ops
lib/Translation/ForthToMLIR/ForthToMLIR.cpp — Parse words
lib/Conversion/ForthToMemRef/ForthToMemRef.cpp — Add conversion patterns
test/Translation/Forth/ — Parser tests
test/Conversion/ForthToMemRef/ — Conversion tests
Priority
Medium — needed for multi-block reductions and scatter patterns. Not required for single-block kernels.
Related
Summary
Add atomic memory operation words for safe concurrent updates from multiple threads.
Words to implement
Integer atomics
ATOMIC+( n addr -- )memref.atomic_rmw addiATOMIC-MAX( n addr -- )memref.atomic_rmw maxsATOMIC-MIN( n addr -- )memref.atomic_rmw minsATOMIC-AND( n addr -- )memref.atomic_rmw andiATOMIC-OR( n addr -- )memref.atomic_rmw oriATOMIC-XOR( n addr -- )memref.atomic_rmw xoriATOMIC-XCHG( n addr -- old )memref.atomic_rmw assignATOMIC-CAS( expected new addr -- old )memref.generic_atomic_rmwFloat atomics
ATOMIC-F+( f addr -- )memref.atomic_rmw addfATOMIC-FMAX( f addr -- )memref.atomic_rmw maximumfATOMIC-FMIN( f addr -- )memref.atomic_rmw minimumfMotivation
ATOMIC-F+for partial sums,ATOMIC-FMAXfor global max).ATOMIC-CASenables lock-free algorithms.Implementation notes
memref.atomic_rmwwith the appropriatearith::AtomicRMWKind.!/F!.ATOMIC-CASis more complex: needsmemref.generic_atomic_rmwwith a comparison body, or lower directly to an LLVMcmpxchg.atom.*instructions.ATOMIC+andATOMIC-F+as the minimum viable set.Files to modify
include/warpforth/Dialect/Forth/ForthOps.td— Define new opslib/Translation/ForthToMLIR/ForthToMLIR.cpp— Parse wordslib/Conversion/ForthToMemRef/ForthToMemRef.cpp— Add conversion patternstest/Translation/Forth/— Parser teststest/Conversion/ForthToMemRef/— Conversion testsPriority
Medium — needed for multi-block reductions and scatter patterns. Not required for single-block kernels.
Related