Skip to content

fix(config): scope worker-memory budget to per-node#24

Merged
russfellows merged 1 commit into
mainfrom
FileSystemGuy-issue448-per-node-memory-budget
Jun 17, 2026
Merged

fix(config): scope worker-memory budget to per-node#24
russfellows merged 1 commit into
mainfrom
FileSystemGuy-issue448-per-node-memory-budget

Conversation

@FileSystemGuy

Copy link
Copy Markdown

Addresses mlcommons/storage#448.

Problem

The memory-budget guard in ConfigArguments.validate (dlio_benchmark/utils/config.py:391-418) compares a global worker count against one node's RAM:

total_workers = self.read_threads * self.comm_size                # global
BUDGET_MB = psutil.virtual_memory().total // (1024 * 1024)        # local
estimated_mb = per_worker_mb * total_workers
if estimated_mb > BUDGET_MB:
    max_threads = BUDGET_MB // per_worker_mb // max(1, self.comm_size)

Operands are in incompatible units, so valid multi-node configurations are rejected. The suggested max_threads then collapses to 0 at ~100 nodes — the user is left with no valid setting.

Reporter's table:

Nodes comm_size "estimated" Actual per-node need Old verdict Reality
1 12 96 GB 96 GB passes fits
4 48 384 GB 96 GB rejects ("max_threads=10") fits
100 1,200 9.6 TB 96 GB rejects ("max_threads=0") fits

The 50%-of-available warning further down has the same bug.

Fix

  • Worker count is per-node: read_threads × DLIOMPI.ranks_per_node(). ranks_per_node() already exists in utils/utility.py:309 and is already used by the auto-sizing path at config.py:774 — this PR aligns the memory check with that established pattern. In CHILD_INITIALIZED state it conservatively falls back to comm_size.
  • Budget basis is psutil.virtual_memory().available with a 90% safety margin so already-used RAM is respected. The previous .total basis blocked large machines for fresh RAM that was never going to be there.
  • Error and warning text now name the host, report local_ranks instead of comm_size, and the max_threads suggestion is derived from per-node arithmetic.

Worked example from the issue: 100 nodes × 12 ranks × read_threads=16, 256 GiB/node — per-node demand is 96 GB (16 × 12 × 0.5), well under budget. Old code computes 9.6 TB and rejects. New code computes 96 GB and passes.

Test plan

  • Manual: 1-node config that exceeds local RAM still raises with the new message.
  • Manual: 4-node × 12-ranks × read_threads=16 on 256 GiB nodes runs end-to-end with no false rejection.
  • Manual: confirm warning text reads "On host X: ... local_ranks=N = M workers ... on this node".

The memory-budget guard in ConfigArguments.validate compared a global
worker count (read_threads × comm_size) against one node's RAM
(psutil.virtual_memory().total). Operands in incompatible units, so
valid multi-node configurations were rejected with a misleading
"reduce read_threads to N" suggestion where N collapsed to 0 at
~100 nodes — leaving the user with no valid setting.

Changes:

- Worker count is now per-node: read_threads × DLIOMPI.ranks_per_node().
  ranks_per_node() already exists (used by the auto-sizing path,
  config.py:774) and falls back to comm_size in CHILD_INITIALIZED state,
  matching the conservative behavior elsewhere in this file.
- Budget basis is now psutil.virtual_memory().available with a 90%
  safety margin (so already-used RAM is respected). The previous
  .total basis blocked large machines for fresh RAM that was never
  going to be there.
- Error and warning messages now name the host, report local_ranks
  instead of comm_size, and the max_threads suggestion is derived
  from per-node arithmetic.
- The 50% available-RAM warning gets the same per-node treatment.

Worked example from the issue: 100 nodes × 12 ranks × read_threads=16,
256 GiB/node — per-node demand is 96 GB (16 × 12 × 0.5), well under
budget. Old code computed 9.6 TB against 256 GB and rejected. New
code computes 96 GB against 256 GB and passes.

Refs mlcommons/storage#448
@FileSystemGuy FileSystemGuy requested a review from a team June 16, 2026 21:52
@FileSystemGuy

Copy link
Copy Markdown
Author

@russfellows @idevasena
Don't think too deeply about this, unless you want to, just approve it. :-)

@russfellows russfellows left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curtis OK'd.

@russfellows russfellows merged commit 1d11f98 into main Jun 17, 2026
7 checks passed
@russfellows russfellows deleted the FileSystemGuy-issue448-per-node-memory-budget branch June 17, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants