Skip to content

Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800

Merged
polarG merged 13 commits into
mainfrom
dev/hongtaozhang/fix-num_workers-usage
May 30, 2026
Merged

Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800
polarG merged 13 commits into
mainfrom
dev/hongtaozhang/fix-num_workers-usage

Conversation

@polarG
Copy link
Copy Markdown
Contributor

@polarG polarG commented Apr 1, 2026

Description
preprocess_data.py uses multiprocessing.Pool(workers), which requires workers >= 1. When num_workers is set to 0 (valid for DataLoader, where it means "load in main process"), the preprocessing step crashes.

This change clamps the worker count to max(1, self._args.num_workers) before passing it to preprocess_data.py, while leaving the original num_workers value unchanged for other uses like DataLoader.

@polarG polarG requested a review from a team as a code owner April 1, 2026 06:07
Copilot AI review requested due to automatic review settings April 1, 2026 06:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a crash in the Megatron GPT-3 dataset preprocessing path when --num_workers=0 by ensuring the worker count passed to Megatron’s tools/preprocess_data.py is at least 1, while keeping num_workers unchanged for other consumers (e.g., PyTorch DataLoader).

Changes:

  • Clamp --workers for preprocess_data.py to max(1, self._args.num_workers) to avoid multiprocessing.Pool(0) failures.
  • Add inline clarification comment documenting why preprocessing clamps workers while other components may allow num_workers=0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py
Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py Outdated
@polarG polarG self-assigned this Apr 17, 2026
@polarG polarG added enhancement New feature or request ROCm labels Apr 27, 2026
- Megatron's preprocess_data.py appends '_text_document' to the
  --output-prefix when producing the .bin/.idx files. Derive the
  output-prefix from --data_prefix (stripping a trailing
  '_text_document' suffix when present) so that the generated files
  match the existence checks for any custom data_prefix value, instead
  of being hardcoded to '<data_home>/dataset'.
- Add unit test test_megatron_gpt_dataset_generate_command covering:
  num_workers=0 clamps to '--workers 1' with default data_prefix
  ('dataset_text_document' -> '<data_home>/dataset'), num_workers=4
  with custom 'custom_text_document' -> '<data_home>/custom', and a
  data_prefix without the suffix used as-is.
Copilot AI review requested due to automatic review settings April 27, 2026 22:02
Avoid '--workers 1' matching '--workers 10' etc.
@polarG polarG force-pushed the dev/hongtaozhang/fix-num_workers-usage branch from a475b85 to ee2839e Compare April 27, 2026 22:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated
Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 27, 2026

Codecov Report

❌ Patch coverage is 94.44444% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 86.02%. Comparing base (b6c3a8f) to head (002bfcb).

Files with missing lines Patch % Lines
...perbench/benchmarks/model_benchmarks/model_base.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #800      +/-   ##
==========================================
+ Coverage   85.89%   86.02%   +0.12%     
==========================================
  Files         103      103              
  Lines        7934     7950      +16     
==========================================
+ Hits         6815     6839      +24     
+ Misses       1119     1111       -8     
Flag Coverage Δ
cpu-python3.10-unit-test 70.88% <94.44%> (+0.16%) ⬆️
cpu-python3.12-unit-test 70.88% <94.44%> (+0.16%) ⬆️
cpu-python3.7-unit-test 70.31% <94.44%> (+0.16%) ⬆️
cuda-unit-test 83.95% <94.44%> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Hongtao Zhang added 2 commits May 2, 2026 00:19
…et generate test

- Replace brittle whitespace substring assertions ('--workers 1 ', '--output-prefix ... ') with normalize_command()-based parsed CLI unit checks, so the test validates semantics rather than formatting.
- Use --code_base {self._tmp_dir} together with createMockFiles(['pretrain_gpt.py']) to avoid the unrealistic /root/Megatron-DeepSpeed path. The mocked run_command now creates the expected .bin/.idx files via side_effect so _preprocess() succeeds end-to-end and is asserted to be True.
- Warn when num_workers is silently clamped from 0 to 1 for the
  preprocess subprocess so the user sees the override in the log.
  The DataLoader still receives the original num_workers value.
- Guard the '_text_document' suffix-strip against the edge case where
  data_prefix == '_text_document' (which would otherwise produce a
  malformed --output-prefix '<data_home>/' with an empty basename).
  Fall back to the original data_prefix value in that case.
- Extend test_megatron_gpt_dataset_generate_command with a 4th case
  asserting the empty-basename fallback.
Copilot AI review requested due to automatic review settings May 3, 2026 05:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py Outdated
Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated
polarG and others added 2 commits May 15, 2026 13:18
- Fix test pollution: clean up pretrain_gpt.py created by
  test_megatron_gpt_dataset_generate_command so it does not break the
  alphabetically-later test_megatron_gpt_preprocess (root cause of CI
  cpu-unit-test failures on python-3.7/3.10/3.12).
- Fix Python 3.7 incompatibility: use call_args_list[0][0][0] instead
  of .args[0] (mock.call.args is Python 3.8+).
- Enforce data_prefix contract in _generate_dataset(): validate that
  data_prefix ends with '_text_document' and has a non-empty stem
  before invoking preprocess_data.py. Fail fast with a clear error
  otherwise (preprocess_data.py always appends '_text_document' to
  --output-prefix, so other prefixes would silently produce files
  whose names mismatch the existence check).
- Replace test Cases 3 and 4 (which masked the contract bug via a
  side_effect that lied about preprocess_data.py output) with
  negative cases that assert _preprocess() fails fast and never
  invokes run_command for invalid data_prefix values.
Copilot AI review requested due to automatic review settings May 15, 2026 20:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

- Collapse multi-line .format() call in _generate_dataset error message.
- Add blank line before 'return _side_effect' in nested function in test.
Copilot AI review requested due to automatic review settings May 27, 2026 04:56
@polarG polarG enabled auto-merge (squash) May 27, 2026 04:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py
Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py
Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated
Copilot AI review requested due to automatic review settings May 29, 2026 18:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py Outdated
Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py Outdated
Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated
Hongtao Zhang added 2 commits May 29, 2026 21:42
…tent return codes

Resolves the remaining unresolved Copilot inline comments on PR #800:

- 3326252223 (megatron_gpt3.py:679): Hoist 'num_workers < 0' validation to the
  top of _generate_dataset() so it fires unconditionally, including when
  dataset files already exist (previously a negative value could slip through
  and be emitted as '--num-workers -1' in the Megatron training command).
- 3326252262 (megatron_gpt3.py:672): Change invalid data_prefix return code
  from DATASET_GENERATION_FAILURE to INVALID_ARGUMENT for consistency with
  other CLI-arg validation in this benchmark (code_base, micro_batch_size,
  precision).
- 3326252281 (test_megatron_gpt.py:248): Update the test to assert
  INVALID_ARGUMENT (matching the new contract) and update
  _run_invalid_workers_case to assert no downloads happen, since num_workers
  is now validated before _generate_dataset() reaches the download block.

Also fixes the CI lint failures on builds 59269 (cpu-unit-test) and 59270
(cuda-unit-test):

- yapf: removing the multi-line 'logger.error(...)' block from the
  '--workers' branch (now a single-line check at the top of the function)
  resolves the yapf reformat diff that was breaking python-3.7 and
  python-3.10 lint jobs.
- C901: silence 'too complex (11)' on test_megatron_gpt_dataset_generate_command
  with '# noqa: C901', consistent with the existing pattern in this repo
  (e.g. _generate_dataset itself, _megatron_command, test_generate_config).

Verification: python -m flake8 on both files exits 0; the 6 directly relevant
megatron tests pass. The pre-existing test_megatron_gpt_command failure in
the local env is an environmental PermissionError on /tmp/ds_config_gpt.json,
unrelated to this PR.
The previous '# noqa: C901' did not suppress the failure on CI's older flake8
(Python 3.7), which attributes the C901 violation to the first decorator line
rather than the def line, so the noqa on the def line was ignored
(build 59278, cpu-unit-test Job python-3.7).

Instead of chasing version-specific noqa placement, reduce the actual
cyclomatic complexity of test_megatron_gpt_dataset_generate_command from 11
to 10 (== max-complexity) and drop the noqa:

- Merge the two near-identical helpers _run_invalid_case and
  _run_invalid_workers_case into a single _run_invalid_case(extra_params,
  expected_downloads), distinguishing the negative-num_workers case (0
  downloads, rejected before any download) from the invalid-data_prefix
  cases (2 downloads: vocab + merges occur before the suffix check).
- Fold the pretrain_gpt.py cleanup into _cleanup_created_files, removing the
  lambda conditional-expression.

Verification: flake8 (default config, max-complexity=10) exits 0 on both
files; the three affected megatron tests still pass.
Copilot AI review requested due to automatic review settings May 29, 2026 23:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@polarG polarG merged commit e271df3 into main May 30, 2026
25 of 27 checks passed
@polarG polarG deleted the dev/hongtaozhang/fix-num_workers-usage branch May 30, 2026 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants