Skip to content

Add PTX vector memory intrinsics#4

Open
ilehtoranta wants to merge 1 commit intoLostBeard:masterfrom
ilehtoranta:codex/ptx-vector-memory-intrinsics
Open

Add PTX vector memory intrinsics#4
ilehtoranta wants to merge 1 commit intoLostBeard:masterfrom
ilehtoranta:codex/ptx-vector-memory-intrinsics

Conversation

@ilehtoranta
Copy link
Copy Markdown

Summary

Adds PTX-only vector memory intrinsics for explicit f32 vector load/store code generation.

This introduces:

  • PTXMemory.LoadF32x2 / StoreF32x2
  • PTXMemory.LoadF32x4 / StoreF32x4
  • Float2 and Float4 helper structs
  • intrinsic registration in the PTX algorithms context
  • aligned/vectorized ArrayView convenience helpers

The main use case is CUDA kernels that need predictable vector memory instructions instead of relying on backend inference from ordinary scalar or struct access patterns.

Details

The new PTX intrinsics generate explicit PTX vector memory operations:

  • ld.v2.f32
  • st.v2.f32
  • ld.v4.f32
  • st.v4.f32

For f32x4, ptxas can lower these to 128-bit global memory instructions such as LD.E.128 and ST.E.128 when alignment and addressing are suitable.

This is useful for performance-sensitive kernels that operate on adjacent float values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant