fix(main): use sudo -n for page-cache flush; warn once on failure#23
Merged
Merged
Conversation
The per-epoch drop_caches call invoked sudo interactively, inheriting the user's TTY from mpirun's stdin. When sudo's NOPASSWD wasn't configured, sudo blocked at the password prompt — and when mlpstorage_py's rich progress bar overwrote the prompt line, users could not see or respond to it. The bare `except Exception: pass` then swallowed the 30s timeout silently every epoch. Reporter of mlcommons/storage#391 sat on a hung training run for ~16 hours. Two changes: 1. `sudo -n` (non-interactive) plus stdin redirected to /dev/null — sudo fails immediately instead of prompting for a password it can't read. 2. A `drop_caches_disabled` flag set on the first failure suppresses subsequent attempts and emits a single explanatory warning telling the user how to enable passwordless sudo for the flush. No more silent 30s/epoch stalls, no more log spam. Behavior on a properly-configured host (NOPASSWD sudo) is unchanged. Refs mlcommons/storage#391
Author
|
@russfellows @idevasena |
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses the DLIO side of mlcommons/storage#391.
Problem
DLIOBenchmark._train()callssubprocess.run(["sudo", "sh", "-c", "echo 3 > /proc/sys/vm/drop_caches"], check=True, timeout=30)at the start of every epoch (dlio_benchmark/main.py:451-458). On a host without NOPASSWD sudo, sudo:except Exception: pass, swallowing the 30s timeout silently.The mlcommons/storage#391 reporter let a single run sit at the prompt for ~16 hours before killing it. (The downstream PyTorch DataLoader fork-after-MPI_Init hang they hit is being addressed separately on the storage side; that's not in scope here.)
Fix
sudo -nso sudo exits non-zero immediately when password input is required, instead of prompting.drop_caches_disabledflag set on the first failure prevents subsequent attempts and emits a single explanatory warning telling the user how to enable passwordless sudo for this exact command. No log spam, no per-epoch stalls.Hosts with NOPASSWD sudo (the supported path) see identical behavior.
Test plan
Defaults:USER NOPASSWD: /bin/shfor the drop_caches command, confirm the flush still happens and the warning never fires.