Update QoLA/AITER by Micky774 · Pull Request #599 · ROCm/TransformerEngine

Micky774 · 2026-05-28T16:18:02Z

Description

Updates QoLA as well as moves up the pinned AITER commit

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

ipanfilo · 2026-05-29T03:50:06Z

+    COMMAND sh -c
+      "tmp=$(mktemp /tmp/gitconfig.XXXXXX) || exit 1; \
+GIT_CONFIG_GLOBAL=$tmp git config --global --add safe.directory '*' >/dev/null 2>&1; \
+GIT_CONFIG_GLOBAL=$tmp PYTHONPATH=\"${__QOLA_DIR}:$PYTHONPATH\" '${Python_EXECUTABLE}' -m qola.cli checkout \


Wouldn't it make sense to integrate safe.directory overriding to qola? The pattern with dubious ownership is probably not TE specific

I worry that such behavior is a bit too authoritative for qola if that makes sense? My reasoning is that the permission scope here seems to be outside of qola and hence qola should not be the one in charge of it. I'm open to reconsidering that, but it's just my initial position.

It is valid concern bearing in mind QoLA is intended to be reused by different components. May be make this behavior controllable then. It is OK to keep the things in TE, it will just require doing things this overriding twice - here and when build is called - BTW, there are comments there but not actual code change

Sounds good, but can we do that in a separate PR to QoLA first?

ipanfilo · 2026-05-29T04:17:26Z

+# commit QoLA will actually check out and build, not whatever happens to be
+# the submodule's current HEAD at configure time.
+set(__QOLA_MANIFEST "${CMAKE_CURRENT_LIST_DIR}/qola_manifest.toml")
+set_property(DIRECTORY APPEND PROPERTY CMAKE_CONFIGURE_DEPENDS "${__QOLA_MANIFEST}")


AITER_SHA is not cached variable. Why is CMAKE_CONFIGURE_DEPENDS needed?

I mainly have it so that direct cmake or ninja building will catch on manifest changes since I use it for incremental builds (and am thinking about including an incremental build option for TE's AITER FA backend). It's not strictly necessary, just a nice to have in such a workflow

Will it also trigger re-executing of QoLA cli in the parent (ck_used_attn/ ) CMAkeLists.txt?

ipanfilo · 2026-05-29T04:29:29Z

+  if(NOT AITER_CHECKOUT_RESULT EQUAL 0)
+    message(FATAL_ERROR
+            "Failed to sync AITER source tree at ${__AITER_SOURCE_DIR} to "
+            "manifest-pinned commit ${AITER_SHA}.\n"


Should it also validate that actual commit matches one detected by prebuilt.cmake? If QoLA checkout AITER unconditionally, may be keep prebuilt.cmake as-is and where it is now? I.e. QoLA fetches AITER, prebuit.cmake checks for git commit as before. It will only loose AITER_SHA value in this error message

How about we add that explicit check for check-out commit = manifest commit but keep the current changes so we can explicitly track and use the manifest commit as a single source of truth for the build?

wangye805 · 2026-06-01T16:03:34Z

+      return nullptr;
+    }
+    void* ptr = nullptr;
+    if(hipMallocAsync(&ptr, bytes, stream) != hipSuccess){


Emm, if we let hip runtime to allocate and manage our buffers, this will create a series of issues. For example, if Pytorch or JAX users pre-allocate 97% of HBM, then our hipMallocAsync will return out of memory.

If what new aiter needs for fmha_args.workspace_alloc is a lambda, we can fake it to give jax/pytorch generated workspace buffer ptr?

wangye805 · 2026-06-01T16:04:18Z

+  // callback thread, which holds runtime locks — calling any HIP API from it
+  // (including hipHostFree) deadlocks against concurrent main-thread HIP
+  // calls. Defer the free to ck_tile::pinned_host_releaser's worker thread.
+  fmha_args.pinned_host_alloc = [](size_t bytes) -> std::shared_ptr<void> {


Emm, why do they need host memory allocated? Is it for inference?

Again, can we fake the lambda to use pytorch/jax generated workspace?

VeeraRajasekhar

Verified that ROCm/rocm-libraries#6764 exists in the current pinned aiter commit.

AllenFarcas · 2026-06-16T21:52:57Z

+  size_t head_dim_v,
+  int64_t window_size_left,
+  int64_t window_size_right,
+  bool is_training, bool cuda_graph) {


From my understanding, is_ck_backend_supported is the only consumer of the new graph-capture gate, but it's called without deterministic (TransformerEngine\transformer_engine\common\fused_attn_rocm\fused_attn.cpp line 320) even though nvte_get_fused_attn_backend has it in scope (TransformerEngine\transformer_engine\common\fused_attn_rocm\fused_attn.cpp line 282).

Since deterministic bwd forces the non-graph-safe CK v2 path, is_ck_bwd_graph_capture_safe cannot make the right call without it. See my other related comments above for what could be added to fix.

You should include here the bool deterministic as well

AllenFarcas · 2026-06-16T21:53:02Z

+         is_v3_arch &&
+         dropout == 0.f &&
+         bias_type == NVTE_Bias_Type::NVTE_NO_BIAS &&
+         max_seqlen_q >= 16;


Add && !deterministic

AllenFarcas · 2026-06-16T21:53:06Z

+  // unaffected). Determinism also forces v2 but is invisible here, so it is handled
+  // on the framework side.
+  if(is_training && cuda_graph &&
+     !is_ck_bwd_graph_capture_safe(bias_type, dropout, max_seqlen_q)){


You should include here the deterministic as well

AllenFarcas · 2026-06-16T21:53:12Z

+static bool is_ck_bwd_graph_capture_safe(
+  NVTE_Bias_Type bias_type,
+  float dropout,
+  size_t max_seqlen_q) {


Add bool deterministic

AllenFarcas · 2026-06-16T21:53:15Z

        window_size_left,
-        window_size_right)){
+        window_size_right,
+        is_training, cuda_graph)){


Pass here the deterministic

AllenFarcas · 2026-06-16T21:59:43Z

+// launcher (fmha_bwd / prepare_workspace_async), which schedules self-deleting
+// hipLaunchHostFunc nodes that double-free on graph replay. Only the v3 asm bwd
+// path is HIP-graph-replay-safe. Mirrors AITER's fmha_v3_bwd gate (mha_bwd.cu)
+// for the conditions visible at backend-selection time; determinism is applied


My understanding is that this comment (and two more below) justify the split design by claiming the C++ selector can't see determinism. But deterministic is literally a parameter of the nvte_get_fused_attn_backend and both frameworks pass it.

The flag isn't invisible, the code just chooses not to forward it one level down, which should be fixed by my previous comment. Leaving these comments as-is will mislead the next person into thinking the Python guard is a hardware/architectural necessity rather than a fixable shortcut.

AllenFarcas · 2026-06-16T22:00:00Z

+  // path; a config that would fall back to the CK v2 launcher is not graph-safe.
+  // Reject such graph-captured training configs so selection falls through to a
+  // graph-safe backend (the v2 host-pack hazard is backward-only, so inference is
+  // unaffected). Determinism also forces v2 but is invisible here, so it is handled


Similar comment here

AllenFarcas · 2026-06-16T22:00:16Z

+    # graph capture this corner is invisible to the C++ backend selector (determinism
+    # is not part of its signature), so fall back to the unfused backend here.


Similar comment here

Micky774 added 11 commits April 28, 2026 17:00

Updated QoLA (to port CK receipt patch) and TE manifest

d89f90f

Updated manifest

312212f

Corrected AITER mha args validation against pinned commit

89f6983

Updated cmake w/ dubious ownership protection

c417e40

Merge branch 'dev' into zain/qola/aiter-update

90e3a5a

Corrected logging

c7ecaf7

Updated qola to build aiter w/ new third_party spec

da6e9a6

Added guards against AITER known buggy implementations

e7ed124

Updated build

6241f99

Drop guard for corrected bug

82ebdb6

Merge branch 'dev' into zain/qola/aiter-update

68a2eaf

Micky774 requested review from ipanfilo, wangye805 and wenchenvincent as code owners May 28, 2026 16:18

Micky774 added the ci-level 3 CI test level 3 label May 28, 2026

Update AITER commit, adopt new API

f9ab59c

ipanfilo reviewed May 29, 2026

View reviewed changes

Micky774 requested review from AllenFarcas and VeeraRajasekhar May 29, 2026 18:26

wangye805 requested changes Jun 1, 2026

View reviewed changes

Micky774 added 2 commits June 5, 2026 19:21

Added guard against graph-unsafe CK V2 kernels

fec80b8

Merge branch 'dev' into zain/qola/aiter-update

6eb3b23

VeeraRajasekhar reviewed Jun 9, 2026

View reviewed changes

Micky774 added 2 commits June 16, 2026 16:20

Updated with AOT memory handling (dynamic on host)

8fd79f9

Merge branch 'dev' into zain/qola/aiter-update

ab68ca9

AllenFarcas reviewed Jun 16, 2026

View reviewed changes

		# graph capture this corner is invisible to the C++ backend selector (determinism
		# is not part of its signature), so fall back to the unfused backend here.

Conversation

Micky774 commented May 28, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VeeraRajasekhar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants