Update QoLA/AITER #599
Conversation
| COMMAND sh -c | ||
| "tmp=$(mktemp /tmp/gitconfig.XXXXXX) || exit 1; \ | ||
| GIT_CONFIG_GLOBAL=$tmp git config --global --add safe.directory '*' >/dev/null 2>&1; \ | ||
| GIT_CONFIG_GLOBAL=$tmp PYTHONPATH=\"${__QOLA_DIR}:$PYTHONPATH\" '${Python_EXECUTABLE}' -m qola.cli checkout \ |
There was a problem hiding this comment.
Wouldn't it make sense to integrate safe.directory overriding to qola? The pattern with dubious ownership is probably not TE specific
There was a problem hiding this comment.
I worry that such behavior is a bit too authoritative for qola if that makes sense? My reasoning is that the permission scope here seems to be outside of qola and hence qola should not be the one in charge of it. I'm open to reconsidering that, but it's just my initial position.
There was a problem hiding this comment.
It is valid concern bearing in mind QoLA is intended to be reused by different components. May be make this behavior controllable then. It is OK to keep the things in TE, it will just require doing things this overriding twice - here and when build is called - BTW, there are comments there but not actual code change
There was a problem hiding this comment.
Sounds good, but can we do that in a separate PR to QoLA first?
| # commit QoLA will actually check out and build, not whatever happens to be | ||
| # the submodule's current HEAD at configure time. | ||
| set(__QOLA_MANIFEST "${CMAKE_CURRENT_LIST_DIR}/qola_manifest.toml") | ||
| set_property(DIRECTORY APPEND PROPERTY CMAKE_CONFIGURE_DEPENDS "${__QOLA_MANIFEST}") |
There was a problem hiding this comment.
AITER_SHA is not cached variable. Why is CMAKE_CONFIGURE_DEPENDS needed?
There was a problem hiding this comment.
I mainly have it so that direct cmake or ninja building will catch on manifest changes since I use it for incremental builds (and am thinking about including an incremental build option for TE's AITER FA backend). It's not strictly necessary, just a nice to have in such a workflow
There was a problem hiding this comment.
Will it also trigger re-executing of QoLA cli in the parent (ck_used_attn/ ) CMAkeLists.txt?
| if(NOT AITER_CHECKOUT_RESULT EQUAL 0) | ||
| message(FATAL_ERROR | ||
| "Failed to sync AITER source tree at ${__AITER_SOURCE_DIR} to " | ||
| "manifest-pinned commit ${AITER_SHA}.\n" |
There was a problem hiding this comment.
Should it also validate that actual commit matches one detected by prebuilt.cmake? If QoLA checkout AITER unconditionally, may be keep prebuilt.cmake as-is and where it is now? I.e. QoLA fetches AITER, prebuit.cmake checks for git commit as before. It will only loose AITER_SHA value in this error message
There was a problem hiding this comment.
How about we add that explicit check for check-out commit = manifest commit but keep the current changes so we can explicitly track and use the manifest commit as a single source of truth for the build?
| return nullptr; | ||
| } | ||
| void* ptr = nullptr; | ||
| if(hipMallocAsync(&ptr, bytes, stream) != hipSuccess){ |
There was a problem hiding this comment.
Emm, if we let hip runtime to allocate and manage our buffers, this will create a series of issues. For example, if Pytorch or JAX users pre-allocate 97% of HBM, then our hipMallocAsync will return out of memory.
If what new aiter needs for fmha_args.workspace_alloc is a lambda, we can fake it to give jax/pytorch generated workspace buffer ptr?
| // callback thread, which holds runtime locks — calling any HIP API from it | ||
| // (including hipHostFree) deadlocks against concurrent main-thread HIP | ||
| // calls. Defer the free to ck_tile::pinned_host_releaser's worker thread. | ||
| fmha_args.pinned_host_alloc = [](size_t bytes) -> std::shared_ptr<void> { |
There was a problem hiding this comment.
Emm, why do they need host memory allocated? Is it for inference?
Again, can we fake the lambda to use pytorch/jax generated workspace?
VeeraRajasekhar
left a comment
There was a problem hiding this comment.
Verified that ROCm/rocm-libraries#6764 exists in the current pinned aiter commit.
| size_t head_dim_v, | ||
| int64_t window_size_left, | ||
| int64_t window_size_right, | ||
| bool is_training, bool cuda_graph) { |
There was a problem hiding this comment.
From my understanding, is_ck_backend_supported is the only consumer of the new graph-capture gate, but it's called without deterministic (TransformerEngine\transformer_engine\common\fused_attn_rocm\fused_attn.cpp line 320) even though nvte_get_fused_attn_backend has it in scope (TransformerEngine\transformer_engine\common\fused_attn_rocm\fused_attn.cpp line 282).
Since deterministic bwd forces the non-graph-safe CK v2 path, is_ck_bwd_graph_capture_safe cannot make the right call without it. See my other related comments above for what could be added to fix.
You should include here the bool deterministic as well
| is_v3_arch && | ||
| dropout == 0.f && | ||
| bias_type == NVTE_Bias_Type::NVTE_NO_BIAS && | ||
| max_seqlen_q >= 16; |
| // unaffected). Determinism also forces v2 but is invisible here, so it is handled | ||
| // on the framework side. | ||
| if(is_training && cuda_graph && | ||
| !is_ck_bwd_graph_capture_safe(bias_type, dropout, max_seqlen_q)){ |
There was a problem hiding this comment.
You should include here the deterministic as well
| static bool is_ck_bwd_graph_capture_safe( | ||
| NVTE_Bias_Type bias_type, | ||
| float dropout, | ||
| size_t max_seqlen_q) { |
There was a problem hiding this comment.
Add bool deterministic
| window_size_left, | ||
| window_size_right)){ | ||
| window_size_right, | ||
| is_training, cuda_graph)){ |
There was a problem hiding this comment.
Pass here the deterministic
| // launcher (fmha_bwd / prepare_workspace_async), which schedules self-deleting | ||
| // hipLaunchHostFunc nodes that double-free on graph replay. Only the v3 asm bwd | ||
| // path is HIP-graph-replay-safe. Mirrors AITER's fmha_v3_bwd gate (mha_bwd.cu) | ||
| // for the conditions visible at backend-selection time; determinism is applied |
There was a problem hiding this comment.
My understanding is that this comment (and two more below) justify the split design by claiming the C++ selector can't see determinism. But deterministic is literally a parameter of the nvte_get_fused_attn_backend and both frameworks pass it.
The flag isn't invisible, the code just chooses not to forward it one level down, which should be fixed by my previous comment. Leaving these comments as-is will mislead the next person into thinking the Python guard is a hardware/architectural necessity rather than a fixable shortcut.
| // path; a config that would fall back to the CK v2 launcher is not graph-safe. | ||
| // Reject such graph-captured training configs so selection falls through to a | ||
| // graph-safe backend (the v2 host-pack hazard is backward-only, so inference is | ||
| // unaffected). Determinism also forces v2 but is invisible here, so it is handled |
| # graph capture this corner is invisible to the C++ backend selector (determinism | ||
| # is not part of its signature), so fall back to the unfused backend here. |
Description
Updates QoLA as well as moves up the pinned AITER commit
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: