Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a crash in the Megatron GPT-3 dataset preprocessing path when --num_workers=0 by ensuring the worker count passed to Megatron’s tools/preprocess_data.py is at least 1, while keeping num_workers unchanged for other consumers (e.g., PyTorch DataLoader).
Changes:
- Clamp
--workersforpreprocess_data.pytomax(1, self._args.num_workers)to avoidmultiprocessing.Pool(0)failures. - Add inline clarification comment documenting why preprocessing clamps workers while other components may allow
num_workers=0.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Megatron's preprocess_data.py appends '_text_document' to the
--output-prefix when producing the .bin/.idx files. Derive the
output-prefix from --data_prefix (stripping a trailing
'_text_document' suffix when present) so that the generated files
match the existence checks for any custom data_prefix value, instead
of being hardcoded to '<data_home>/dataset'.
- Add unit test test_megatron_gpt_dataset_generate_command covering:
num_workers=0 clamps to '--workers 1' with default data_prefix
('dataset_text_document' -> '<data_home>/dataset'), num_workers=4
with custom 'custom_text_document' -> '<data_home>/custom', and a
data_prefix without the suffix used as-is.
Avoid '--workers 1' matching '--workers 10' etc.
a475b85 to
ee2839e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #800 +/- ##
==========================================
+ Coverage 85.89% 86.02% +0.12%
==========================================
Files 103 103
Lines 7934 7950 +16
==========================================
+ Hits 6815 6839 +24
+ Misses 1119 1111 -8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…et generate test
- Replace brittle whitespace substring assertions ('--workers 1 ', '--output-prefix ... ') with normalize_command()-based parsed CLI unit checks, so the test validates semantics rather than formatting.
- Use --code_base {self._tmp_dir} together with createMockFiles(['pretrain_gpt.py']) to avoid the unrealistic /root/Megatron-DeepSpeed path. The mocked run_command now creates the expected .bin/.idx files via side_effect so _preprocess() succeeds end-to-end and is asserted to be True.
- Warn when num_workers is silently clamped from 0 to 1 for the preprocess subprocess so the user sees the override in the log. The DataLoader still receives the original num_workers value. - Guard the '_text_document' suffix-strip against the edge case where data_prefix == '_text_document' (which would otherwise produce a malformed --output-prefix '<data_home>/' with an empty basename). Fall back to the original data_prefix value in that case. - Extend test_megatron_gpt_dataset_generate_command with a 4th case asserting the empty-basename fallback.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix test pollution: clean up pretrain_gpt.py created by test_megatron_gpt_dataset_generate_command so it does not break the alphabetically-later test_megatron_gpt_preprocess (root cause of CI cpu-unit-test failures on python-3.7/3.10/3.12). - Fix Python 3.7 incompatibility: use call_args_list[0][0][0] instead of .args[0] (mock.call.args is Python 3.8+). - Enforce data_prefix contract in _generate_dataset(): validate that data_prefix ends with '_text_document' and has a non-empty stem before invoking preprocess_data.py. Fail fast with a clear error otherwise (preprocess_data.py always appends '_text_document' to --output-prefix, so other prefixes would silently produce files whose names mismatch the existence check). - Replace test Cases 3 and 4 (which masked the contract bug via a side_effect that lied about preprocess_data.py output) with negative cases that assert _preprocess() fails fast and never invokes run_command for invalid data_prefix values.
- Collapse multi-line .format() call in _generate_dataset error message. - Add blank line before 'return _side_effect' in nested function in test.
…tent return codes Resolves the remaining unresolved Copilot inline comments on PR #800: - 3326252223 (megatron_gpt3.py:679): Hoist 'num_workers < 0' validation to the top of _generate_dataset() so it fires unconditionally, including when dataset files already exist (previously a negative value could slip through and be emitted as '--num-workers -1' in the Megatron training command). - 3326252262 (megatron_gpt3.py:672): Change invalid data_prefix return code from DATASET_GENERATION_FAILURE to INVALID_ARGUMENT for consistency with other CLI-arg validation in this benchmark (code_base, micro_batch_size, precision). - 3326252281 (test_megatron_gpt.py:248): Update the test to assert INVALID_ARGUMENT (matching the new contract) and update _run_invalid_workers_case to assert no downloads happen, since num_workers is now validated before _generate_dataset() reaches the download block. Also fixes the CI lint failures on builds 59269 (cpu-unit-test) and 59270 (cuda-unit-test): - yapf: removing the multi-line 'logger.error(...)' block from the '--workers' branch (now a single-line check at the top of the function) resolves the yapf reformat diff that was breaking python-3.7 and python-3.10 lint jobs. - C901: silence 'too complex (11)' on test_megatron_gpt_dataset_generate_command with '# noqa: C901', consistent with the existing pattern in this repo (e.g. _generate_dataset itself, _megatron_command, test_generate_config). Verification: python -m flake8 on both files exits 0; the 6 directly relevant megatron tests pass. The pre-existing test_megatron_gpt_command failure in the local env is an environmental PermissionError on /tmp/ds_config_gpt.json, unrelated to this PR.
The previous '# noqa: C901' did not suppress the failure on CI's older flake8 (Python 3.7), which attributes the C901 violation to the first decorator line rather than the def line, so the noqa on the def line was ignored (build 59278, cpu-unit-test Job python-3.7). Instead of chasing version-specific noqa placement, reduce the actual cyclomatic complexity of test_megatron_gpt_dataset_generate_command from 11 to 10 (== max-complexity) and drop the noqa: - Merge the two near-identical helpers _run_invalid_case and _run_invalid_workers_case into a single _run_invalid_case(extra_params, expected_downloads), distinguishing the negative-num_workers case (0 downloads, rejected before any download) from the invalid-data_prefix cases (2 downloads: vocab + merges occur before the suffix check). - Fold the pretrain_gpt.py cleanup into _cleanup_created_files, removing the lambda conditional-expression. Verification: flake8 (default config, max-complexity=10) exits 0 on both files; the three affected megatron tests still pass.
Description
preprocess_data.py uses multiprocessing.Pool(workers), which requires workers >= 1. When num_workers is set to 0 (valid for DataLoader, where it means "load in main process"), the preprocessing step crashes.
This change clamps the worker count to max(1, self._args.num_workers) before passing it to preprocess_data.py, while leaving the original num_workers value unchanged for other uses like DataLoader.