Skip to content

fix(interface): reject duplicate names within output_processors (#675)#697

Merged
nabinchha merged 1 commit into
NVIDIA-NeMo:mainfrom
SAY-5:fix/675-distinct-output-processor-names
May 21, 2026
Merged

fix(interface): reject duplicate names within output_processors (#675)#697
nabinchha merged 1 commit into
NVIDIA-NeMo:mainfrom
SAY-5:fix/675-distinct-output-processor-names

Conversation

@SAY-5
Copy link
Copy Markdown
Contributor

@SAY-5 SAY-5 commented May 21, 2026

📋 Summary

_validate_distinct_output_processors only compared output-processor names against the stage's existing processors, so two output processors with the same name silently overwrote one another's artifact directory. This patch raises DataDesignerWorkflowError when names repeat within output_processors.

🔗 Related Issue

Fixes #675

🔄 Changes

  • Detect intra-list duplicates in _validate_distinct_output_processors before the cross-list check.
  • Add a regression test that asserts the new error message in test_composite_workflow.py.

🧪 Testing

  • Targeted pytest suite for composite_workflow (44 tests) passes.
  • Unit test added that fails without the fix.

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)

@SAY-5 SAY-5 requested a review from a team as a code owner May 21, 2026 07:49
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 21, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@SAY-5
Copy link
Copy Markdown
Contributor Author

SAY-5 commented May 21, 2026

I have read the DCO document and I hereby sign the DCO.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 21, 2026

Greptile Summary

This PR fixes a silent bug where two output_processors with the same name would silently overwrite each other's artifact directory. The fix adds an intra-list duplicate check inside _validate_distinct_output_processors that raises DataDesignerWorkflowError before the existing cross-list check runs.

  • Adds a two-pass duplicate detection (via seen/duplicate_within sets) at the top of _validate_distinct_output_processors in composite_workflow.py.
  • Adds a targeted regression test in test_composite_workflow.py confirming the new error message fires when two DropColumnsProcessorConfig entries share a name.

Confidence Score: 5/5

Safe to merge — the change is a targeted guard added before an already-working check, with no behavioral impact on valid inputs.

The fix is minimal: two sets and a single loop pass detect intra-list duplicates before the existing cross-list validation runs. Valid inputs are unaffected, the error path is clearly tested, and no existing tests were modified.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer/src/data_designer/interface/composite_workflow.py Adds intra-list duplicate name detection to _validate_distinct_output_processors; logic is correct and the existing cross-list check is preserved.
packages/data-designer/tests/interface/test_composite_workflow.py Adds a well-scoped regression test that verifies the new error message fires when two output processors share a name.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[add_stage called with output_processors] --> B[_validate_distinct_output_processors]
    B --> C{Any duplicate names\nwithin output_processors?}
    C -- Yes --> D[Raise DataDesignerWorkflowError\n'distinct within output_processors']
    C -- No --> E{Any names clash with\nexisting stage processors?}
    E -- Yes --> F[Raise DataDesignerWorkflowError\n'distinct from stage processor names']
    E -- No --> G[Validation passes]
Loading

Reviews (1): Last reviewed commit: "fix(interface): reject duplicate names w..." | Re-trigger Greptile

@nabinchha nabinchha added the triaged Issue reviewed and approved by a maintainer label May 21, 2026
@nabinchha
Copy link
Copy Markdown
Contributor

Thanks for the quick turnaround on this, @SAY-5 — nice tight, well-scoped fix.

Summary

Extends _validate_distinct_output_processors to flag intra-list duplicates inside output_processors before doing the existing cross-list check against stage processors, plus a regression test. Matches the intent in #675 exactly: two output processors with the same name would have silently clobbered each other's artifact dir, and now you get a clear DataDesignerWorkflowError instead.

Findings

Suggestions — Worth considering

packages/data-designer/src/data_designer/interface/composite_workflow.py:561-569 — Optional simplification with Counter

  • What: The intra-list dedup loop is fine as-is, but it can be expressed slightly more idiomatically with collections.Counter, which also extends naturally if you ever want to surface the duplicate count.
  • Why: Purely stylistic — current loop is clear and KISS-compliant. Mention it only because it shaves a couple of lines and removes the seen/duplicate_within bookkeeping.
  • Suggestion:
from collections import Counter

counts = Counter(processor.name for processor in output_processors)
duplicates = sorted(name for name, count in counts.items() if count > 1)
if duplicates:
    raise DataDesignerWorkflowError(
        f"Output processor names must be distinct within output_processors: {', '.join(duplicates)}."
    )

Totally happy to leave the explicit loop if you prefer it — readability is on par.

packages/data-designer/tests/interface/test_composite_workflow.py:391 — Slightly tighten the regex

  • What: match="distinct within output_processors" already disambiguates from the cross-list error, which is great. If you wanted to be extra defensive, you could match the duplicated name too, so an accidental rewording that drops the offending names from the message would still fail the test.
  • Why: Minor — guards against silent regressions in the error payload.
  • Suggestion:
with pytest.raises(DataDesignerWorkflowError, match=r"distinct within output_processors: drop_scratch"):

What Looks Good

  • Right place for the check: the intra-list rule lives inside _validate_distinct_output_processors, so the function's name still accurately describes everything it enforces, and the call site in add_stage is unchanged.
  • Validation runs before any state mutation (self._stages.append(...)), so a rejected add_stage leaves the workflow untouched — consistent with the surrounding validators.
  • Error message is sorted (", ".join(sorted(...))) so it's deterministic across runs — small detail, but it'll make test output / log diffs stable.
  • The new test pairs nicely with test_composite_workflow_rejects_duplicate_output_processor_names and uses two distinct column_names to confirm we're rejecting by name, not by object identity.
  • Canonical error type (DataDesignerWorkflowError) is preserved, in line with the interface error contract in AGENTS.md / STYLEGUIDE.md.

Verdict

Ship it (with nits) — both suggestions above are optional. ruff check and ruff format --check pass on the changed files. Nothing blocking.


This review was generated by an AI assistant.

@nabinchha nabinchha removed the triaged Issue reviewed and approved by a maintainer label May 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Issue #675 has been triaged. The linked issue check is being re-evaluated.

Copy link
Copy Markdown
Contributor

@johnnygreco johnnygreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 Looks good – Thank you!!

@nabinchha nabinchha merged commit 000fc09 into NVIDIA-NeMo:main May 21, 2026
56 of 59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reject duplicate output processor names in composite workflows

3 participants