Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation by polarG · Pull Request #800 · microsoft/superbenchmark

polarG · 2026-04-01T06:07:20Z

Description
preprocess_data.py uses multiprocessing.Pool(workers), which requires workers >= 1. When num_workers is set to 0 (valid for DataLoader, where it means "load in main process"), the preprocessing step crashes.

This change clamps the worker count to max(1, self._args.num_workers) before passing it to preprocess_data.py, while leaving the original num_workers value unchanged for other uses like DataLoader.

Copilot

Pull request overview

This PR fixes a crash in the Megatron GPT-3 dataset preprocessing path when --num_workers=0 by ensuring the worker count passed to Megatron’s tools/preprocess_data.py is at least 1, while keeping num_workers unchanged for other consumers (e.g., PyTorch DataLoader).

Changes:

Clamp --workers for preprocess_data.py to max(1, self._args.num_workers) to avoid multiprocessing.Pool(0) failures.
Add inline clarification comment documenting why preprocessing clamps workers while other components may allow num_workers=0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Megatron's preprocess_data.py appends '_text_document' to the --output-prefix when producing the .bin/.idx files. Derive the output-prefix from --data_prefix (stripping a trailing '_text_document' suffix when present) so that the generated files match the existence checks for any custom data_prefix value, instead of being hardcoded to '<data_home>/dataset'. - Add unit test test_megatron_gpt_dataset_generate_command covering: num_workers=0 clamps to '--workers 1' with default data_prefix ('dataset_text_document' -> '<data_home>/dataset'), num_workers=4 with custom 'custom_text_document' -> '<data_home>/custom', and a data_prefix without the suffix used as-is.

Avoid '--workers 1' matching '--workers 10' etc.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

codecov · 2026-04-27T23:20:03Z

Codecov Report

❌ Patch coverage is 94.44444% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 86.02%. Comparing base (b6c3a8f) to head (002bfcb).

Files with missing lines	Patch %	Lines
...perbench/benchmarks/model_benchmarks/model_base.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #800      +/-   ##
==========================================
+ Coverage   85.89%   86.02%   +0.12%     
==========================================
  Files         103      103              
  Lines        7934     7950      +16     
==========================================
+ Hits         6815     6839      +24     
+ Misses       1119     1111       -8

Flag	Coverage Δ
cpu-python3.10-unit-test	`70.88% <94.44%> (+0.16%)`	⬆️
cpu-python3.12-unit-test	`70.88% <94.44%> (+0.16%)`	⬆️
cpu-python3.7-unit-test	`70.31% <94.44%> (+0.16%)`	⬆️
cuda-unit-test	`83.95% <94.44%> (+0.13%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…et generate test - Replace brittle whitespace substring assertions ('--workers 1 ', '--output-prefix ... ') with normalize_command()-based parsed CLI unit checks, so the test validates semantics rather than formatting. - Use --code_base {self._tmp_dir} together with createMockFiles(['pretrain_gpt.py']) to avoid the unrealistic /root/Megatron-DeepSpeed path. The mocked run_command now creates the expected .bin/.idx files via side_effect so _preprocess() succeeds end-to-end and is asserted to be True.

- Warn when num_workers is silently clamped from 0 to 1 for the preprocess subprocess so the user sees the override in the log. The DataLoader still receives the original num_workers value. - Guard the '_text_document' suffix-strip against the edge case where data_prefix == '_text_document' (which would otherwise produce a malformed --output-prefix '<data_home>/' with an empty basename). Fall back to the original data_prefix value in that case. - Extend test_megatron_gpt_dataset_generate_command with a 4th case asserting the empty-basename fallback.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Fix test pollution: clean up pretrain_gpt.py created by test_megatron_gpt_dataset_generate_command so it does not break the alphabetically-later test_megatron_gpt_preprocess (root cause of CI cpu-unit-test failures on python-3.7/3.10/3.12). - Fix Python 3.7 incompatibility: use call_args_list[0][0][0] instead of .args[0] (mock.call.args is Python 3.8+). - Enforce data_prefix contract in _generate_dataset(): validate that data_prefix ends with '_text_document' and has a non-empty stem before invoking preprocess_data.py. Fail fast with a clear error otherwise (preprocess_data.py always appends '_text_document' to --output-prefix, so other prefixes would silently produce files whose names mismatch the existence check). - Replace test Cases 3 and 4 (which masked the contract bug via a side_effect that lied about preprocess_data.py output) with negative cases that assert _preprocess() fails fast and never invokes run_command for invalid data_prefix values.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

- Collapse multi-line .format() call in _generate_dataset error message. - Add blank line before 'return _side_effect' in nested function in test.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

…tent return codes Resolves the remaining unresolved Copilot inline comments on PR #800: - 3326252223 (megatron_gpt3.py:679): Hoist 'num_workers < 0' validation to the top of _generate_dataset() so it fires unconditionally, including when dataset files already exist (previously a negative value could slip through and be emitted as '--num-workers -1' in the Megatron training command). - 3326252262 (megatron_gpt3.py:672): Change invalid data_prefix return code from DATASET_GENERATION_FAILURE to INVALID_ARGUMENT for consistency with other CLI-arg validation in this benchmark (code_base, micro_batch_size, precision). - 3326252281 (test_megatron_gpt.py:248): Update the test to assert INVALID_ARGUMENT (matching the new contract) and update _run_invalid_workers_case to assert no downloads happen, since num_workers is now validated before _generate_dataset() reaches the download block. Also fixes the CI lint failures on builds 59269 (cpu-unit-test) and 59270 (cuda-unit-test): - yapf: removing the multi-line 'logger.error(...)' block from the '--workers' branch (now a single-line check at the top of the function) resolves the yapf reformat diff that was breaking python-3.7 and python-3.10 lint jobs. - C901: silence 'too complex (11)' on test_megatron_gpt_dataset_generate_command with '# noqa: C901', consistent with the existing pattern in this repo (e.g. _generate_dataset itself, _megatron_command, test_generate_config). Verification: python -m flake8 on both files exits 0; the 6 directly relevant megatron tests pass. The pre-existing test_megatron_gpt_command failure in the local env is an environmental PermissionError on /tmp/ds_config_gpt.json, unrelated to this PR.

The previous '# noqa: C901' did not suppress the failure on CI's older flake8 (Python 3.7), which attributes the C901 violation to the first decorator line rather than the def line, so the noqa on the def line was ignored (build 59278, cpu-unit-test Job python-3.7). Instead of chasing version-specific noqa placement, reduce the actual cyclomatic complexity of test_megatron_gpt_dataset_generate_command from 11 to 10 (== max-complexity) and drop the noqa: - Merge the two near-identical helpers _run_invalid_case and _run_invalid_workers_case into a single _run_invalid_case(extra_params, expected_downloads), distinguishing the negative-num_workers case (0 downloads, rejected before any download) from the invalid-data_prefix cases (2 downloads: vocab + merges occur before the suffix check). - Fold the pretrain_gpt.py cleanup into _cleanup_created_files, removing the lambda conditional-expression. Verification: flake8 (default config, max-complexity=10) exits 0 on both files; the three affected megatron tests still pass.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Fix the num_workers usage.

203a883

polarG requested a review from a team as a code owner April 1, 2026 06:07

Copilot AI review requested due to automatic review settings April 1, 2026 06:07

Copilot started reviewing on behalf of polarG April 1, 2026 06:08 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py

Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py Outdated

polarG self-assigned this Apr 17, 2026

polarG added enhancement New feature or request ROCm labels Apr 27, 2026

Copilot AI review requested due to automatic review settings April 27, 2026 22:02

Tighten test substring matches with trailing space

ee2839e

Avoid '--workers 1' matching '--workers 10' etc.

polarG force-pushed the dev/hongtaozhang/fix-num_workers-usage branch from a475b85 to ee2839e Compare April 27, 2026 22:03

Copilot AI reviewed Apr 27, 2026

View reviewed changes

Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated

Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated

Hongtao Zhang added 2 commits May 2, 2026 00:19

Copilot AI review requested due to automatic review settings May 3, 2026 05:05

Copilot started reviewing on behalf of polarG May 3, 2026 05:05 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py Outdated

Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated

polarG and others added 2 commits May 15, 2026 13:18

Merge branch 'main' into dev/hongtaozhang/fix-num_workers-usage

77854b1

Copilot AI review requested due to automatic review settings May 15, 2026 20:50

Copilot started reviewing on behalf of polarG May 15, 2026 20:50 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

style: apply yapf 0.31.0 formatting (CI lint fix)

d580e6e

- Collapse multi-line .format() call in _generate_dataset error message. - Add blank line before 'return _side_effect' in nested function in test.

guoshzhao approved these changes May 26, 2026

View reviewed changes

WenqingLan1 approved these changes May 26, 2026

View reviewed changes

Merge branch 'main' into dev/hongtaozhang/fix-num_workers-usage

f786c64

Copilot AI review requested due to automatic review settings May 27, 2026 04:56

polarG enabled auto-merge (squash) May 27, 2026 04:56

Copilot started reviewing on behalf of polarG May 27, 2026 04:56 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py

Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py

Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated

Hongtao Zhang and others added 2 commits May 27, 2026 21:28

Harden dataset generation argument validation

a88294d

Merge branch 'main' into dev/hongtaozhang/fix-num_workers-usage

58d15fb

Copilot AI review requested due to automatic review settings May 29, 2026 18:30

Copilot started reviewing on behalf of polarG May 29, 2026 18:30 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py Outdated

Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py Outdated

Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated

Hongtao Zhang added 2 commits May 29, 2026 21:42

Copilot AI review requested due to automatic review settings May 29, 2026 23:57

Copilot started reviewing on behalf of polarG May 29, 2026 23:57 View session

Copilot AI reviewed May 30, 2026

View reviewed changes

polarG merged commit e271df3 into main May 30, 2026
25 of 27 checks passed

polarG deleted the dev/hongtaozhang/fix-num_workers-usage branch May 30, 2026 01:06

Conversation

polarG commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Apr 27, 2026 •

edited

Loading