Skip to content

Fix PySparkProcessor V3 ProcessingInput construction#5759

Open
Evan-W-ang wants to merge 1 commit intoaws:masterfrom
Evan-W-ang:fix/pysparkprocessor-v3-processinginput
Open

Fix PySparkProcessor V3 ProcessingInput construction#5759
Evan-W-ang wants to merge 1 commit intoaws:masterfrom
Evan-W-ang:fix/pysparkprocessor-v3-processinginput

Conversation

@Evan-W-ang
Copy link
Copy Markdown

Use V3-compatible ProcessingInput construction in PySparkProcessor.

PySparkProcessor still built internal ProcessingInput objects with the
legacy source/destination fields in _stage_configuration() and
_stage_submit_deps(). In V3, ProcessingInput now expects s3_input, so
those internal code paths can fail during pipeline definition or upsert
with validation errors.

This change updates both code paths to build ProcessingInput with
ProcessingS3Input while preserving the same staged S3 URIs and local
mount paths. It also adds regression tests covering configuration
staging and local dependency staging

@Evan-W-ang
Copy link
Copy Markdown
Author

Summary

This PR updates PySparkProcessor to construct ProcessingInput using the
V3-compatible s3_input=ProcessingS3Input(...) shape instead of the legacy
source / destination fields.

Problem

In V3, sagemaker.core.processing.ProcessingInput no longer accepts:

  • source
  • destination

and instead expects V3 fields such as input_name and s3_input.

However, PySparkProcessor still used the legacy constructor internally in:

  • _stage_configuration()
  • _stage_submit_deps()

This can cause validation failures during pipeline definition / upsert.

Fix

This change:

  1. replaces internal legacy ProcessingInput(...) construction with
    V3-style ProcessingS3Input(...)
  2. preserves the existing S3 staging behavior
  3. preserves the existing local mount path behavior
  4. avoids relying on legacy .destination access where an explicit local path is sufficient

Tests

Added regression tests covering:

  • _stage_configuration() building a V3-compatible ProcessingInput
  • _stage_submit_deps() building a V3-compatible ProcessingInput for local dependencies

Example failure before this change

ValidationError: 2 validation errors for ProcessingInput
source
  Extra inputs are not permitted
destination
  Extra inputs are not permitted

Motivation

Users migrating to V3 naturally update their own processing inputs/outputs to the new schema, but Spark processing can still fail because of internal legacy construction in 
PySparkProcessor. This patch makes that internal behavior consistent with the V3 processing models.


**Test command**
```bash
cd ~/sagemaker-python-sdk/sagemaker-core
. .venv/bin/activate
python -m pytest tests/unit/spark/test_processing.py tests/unit/test_processing.py -q

Files to include

sagemaker-core/src/sagemaker/core/spark/processing.py
sagemaker-core/tests/unit/spark/test_processing.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant