Skip to content

fix: bump DLIO pin for storage #391 (sudo prompt) + #448 (per-node memory budget)#462

Merged
russfellows merged 2 commits into
mainfrom
FileSystemGuy-issue448-pin-dlio
Jun 17, 2026
Merged

fix: bump DLIO pin for storage #391 (sudo prompt) + #448 (per-node memory budget)#462
russfellows merged 2 commits into
mainfrom
FileSystemGuy-issue448-pin-dlio

Conversation

@FileSystemGuy

Copy link
Copy Markdown
Contributor

Resolves #448.
Also resolves the DLIO half of #391 (the silent 30s/epoch stall caused by interactive sudo in drop_caches).

Summary

Bumps the DLIO pin from 60fd3b8 to 814f3ff, which carries two new upstream fixes:

The intermediate pin point is a combined branch (FileSystemGuy-combined-391-448) in DLIO_local_changes so we get both fixes from a single commit while the upstream PRs land. Once they merge, this repo can be re-pinned to branch = "main" and the combined branch can be deleted.

What is NOT in this PR

  • The mlpstorage-side workarounds for Training hangs when using file system #391 (e.g. stdin=DEVNULL on CommandExecutor.execute) are still useful as defense in depth but are not strictly required once the DLIO fix lands. Left for a separate decision.
  • Tests for the new DLIO behavior live in the DLIO PRs themselves (well, would — neither PR added a unit test; both behaviors are easier to verify with a real mpirun against the changed code).

Test plan

…mory budget)

Pin moves from 60fd3b8 to 814f3ff (FileSystemGuy-combined-391-448 in
mlcommons/DLIO_local_changes), which adds two fixes on top of the
existing pin:

- DLIO PR #23 (storage #391 follow-up): per-epoch page-cache flush
  uses `sudo -n` instead of interactive sudo, and disables itself
  after the first failure with a one-line remediation warning. Stops
  the silent 30s-per-epoch stall (and the hours-long hang the
  storage #391 reporter saw) on hosts without NOPASSWD sudo.

- DLIO PR #24 (storage #448): worker-memory budget guard now scopes
  to per-node (read_threads x ranks_per_node), uses
  psutil.virtual_memory().available with a 90% safety margin instead
  of .total, and reports hostname + local_ranks in the error/warning
  text. Removes the false-positive that rejected valid multi-node
  configurations and collapsed max_threads to 0 at ~100 nodes.

Once both DLIO PRs merge into main, revert the pin to branch = "main"
and delete the combined branch.

Refs #391, #448
@FileSystemGuy FileSystemGuy requested a review from a team June 16, 2026 21:57
@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@russfellows russfellows left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving per Curtis.

@russfellows russfellows merged commit 3176c92 into main Jun 17, 2026
3 checks passed
@russfellows russfellows deleted the FileSystemGuy-issue448-pin-dlio branch June 17, 2026 16:28
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 17, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Worker-memory budget check compares per-node RAM against global (comm_size-wide) worker count → false positives at scale

2 participants