fix: bump DLIO pin for storage #391 (sudo prompt) + #448 (per-node memory budget)#462
Merged
Merged
Conversation
…mory budget) Pin moves from 60fd3b8 to 814f3ff (FileSystemGuy-combined-391-448 in mlcommons/DLIO_local_changes), which adds two fixes on top of the existing pin: - DLIO PR #23 (storage #391 follow-up): per-epoch page-cache flush uses `sudo -n` instead of interactive sudo, and disables itself after the first failure with a one-line remediation warning. Stops the silent 30s-per-epoch stall (and the hours-long hang the storage #391 reporter saw) on hosts without NOPASSWD sudo. - DLIO PR #24 (storage #448): worker-memory budget guard now scopes to per-node (read_threads x ranks_per_node), uses psutil.virtual_memory().available with a 90% safety margin instead of .total, and reports hostname + local_ranks in the error/warning text. Removes the false-positive that rejected valid multi-node configurations and collapsed max_threads to 0 at ~100 nodes. Once both DLIO PRs merge into main, revert the pin to branch = "main" and delete the combined branch. Refs #391, #448
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
russfellows
approved these changes
Jun 17, 2026
russfellows
left a comment
Contributor
There was a problem hiding this comment.
Approving per Curtis.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #448.
Also resolves the DLIO half of #391 (the silent 30s/epoch stall caused by interactive
sudoindrop_caches).Summary
Bumps the DLIO pin from
60fd3b8to814f3ff, which carries two new upstream fixes:sudo -n(non-interactive) and disables itself after the first failure with a single explanatory warning. Fixes the interactive-sudo prompt that the reporter of Training hangs when using file system #391 hit (rich progress bar overwrote the prompt; user couldn't see/respond; 16-hour silent hang).read_threads × ranks_per_node()), usespsutil.virtual_memory().availablewith a 90% safety margin instead of.total, and reports hostname +local_ranksin the error text. Fixes the false-positive that rejected valid multi-node configs and collapsedmax_threadsto 0 at ~100 nodes (Worker-memory budget check compares per-node RAM against global (comm_size-wide) worker count → false positives at scale #448).The intermediate pin point is a combined branch (
FileSystemGuy-combined-391-448) in DLIO_local_changes so we get both fixes from a single commit while the upstream PRs land. Once they merge, this repo can be re-pinned tobranch = "main"and the combined branch can be deleted.What is NOT in this PR
stdin=DEVNULLonCommandExecutor.execute) are still useful as defense in depth but are not strictly required once the DLIO fix lands. Left for a separate decision.mpirunagainst the changed code).Test plan
uv lockresolves the new SHA cleanly.uv syncinstalls the bumped DLIO without resolver pain.uv run pytest tests/unit -q→ 1362 passed, 4 skipped. No regression versus the prior pin.comm_size-wide) worker count → false positives at scale #448 4-node × 12-ranks ×read_threads=16config on 256 GiB nodes and confirm the memory-budget guard no longer falsely rejects.