Skip to content

fix(main): use sudo -n for page-cache flush; warn once on failure#23

Merged
russfellows merged 1 commit into
mainfrom
FileSystemGuy-issue391-sudo-noninteractive
Jun 17, 2026
Merged

fix(main): use sudo -n for page-cache flush; warn once on failure#23
russfellows merged 1 commit into
mainfrom
FileSystemGuy-issue391-sudo-noninteractive

Conversation

@FileSystemGuy

Copy link
Copy Markdown

Addresses the DLIO side of mlcommons/storage#391.

Problem

DLIOBenchmark._train() calls subprocess.run(["sudo", "sh", "-c", "echo 3 > /proc/sys/vm/drop_caches"], check=True, timeout=30) at the start of every epoch (dlio_benchmark/main.py:451-458). On a host without NOPASSWD sudo, sudo:

  • Inherits the controlling terminal from mpirun and prompts for a password the user can't see (in the storage repro, the rich progress bar overwrote the prompt line).
  • Falls through to the bare except Exception: pass, swallowing the 30s timeout silently.
  • Re-runs every epoch, so a 5-epoch unet3d run takes ~2.5 minutes extra before any I/O begins.

The mlcommons/storage#391 reporter let a single run sit at the prompt for ~16 hours before killing it. (The downstream PyTorch DataLoader fork-after-MPI_Init hang they hit is being addressed separately on the storage side; that's not in scope here.)

Fix

  • Use sudo -n so sudo exits non-zero immediately when password input is required, instead of prompting.
  • Redirect stdin/stdout/stderr explicitly so nothing leaks to the user's terminal and stderr is captured for the warning.
  • A drop_caches_disabled flag set on the first failure prevents subsequent attempts and emits a single explanatory warning telling the user how to enable passwordless sudo for this exact command. No log spam, no per-epoch stalls.

Hosts with NOPASSWD sudo (the supported path) see identical behavior.

Test plan

  • Manual: on a host without NOPASSWD sudo, confirm the run no longer pauses at "Starting epoch 1" and the one-line warning appears in the log.
  • Manual: on a host with Defaults:USER NOPASSWD: /bin/sh for the drop_caches command, confirm the flush still happens and the warning never fires.

The per-epoch drop_caches call invoked sudo interactively, inheriting the
user's TTY from mpirun's stdin. When sudo's NOPASSWD wasn't configured,
sudo blocked at the password prompt — and when mlpstorage_py's rich
progress bar overwrote the prompt line, users could not see or respond
to it. The bare `except Exception: pass` then swallowed the 30s timeout
silently every epoch. Reporter of mlcommons/storage#391 sat on a hung
training run for ~16 hours.

Two changes:

1. `sudo -n` (non-interactive) plus stdin redirected to /dev/null —
   sudo fails immediately instead of prompting for a password it can't
   read.

2. A `drop_caches_disabled` flag set on the first failure suppresses
   subsequent attempts and emits a single explanatory warning telling
   the user how to enable passwordless sudo for the flush. No more
   silent 30s/epoch stalls, no more log spam.

Behavior on a properly-configured host (NOPASSWD sudo) is unchanged.

Refs mlcommons/storage#391
@FileSystemGuy FileSystemGuy requested a review from a team June 16, 2026 21:51
@FileSystemGuy

Copy link
Copy Markdown
Author

@russfellows @idevasena
Don't think too deeply about this, unless you want to, just approve it. :-)

@russfellows russfellows left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curtis asked for approval.

@russfellows russfellows merged commit 4626f48 into main Jun 17, 2026
7 checks passed
@russfellows russfellows deleted the FileSystemGuy-issue391-sudo-noninteractive branch June 17, 2026 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants