Skip to content

fix(remote): remove -U flag from auto-injected sagemaker requirements install#5922

Open
nileshpatil6 wants to merge 1 commit into
aws:masterfrom
nileshpatil6:fix/remote-decorator-drop-pip-upgrade-flag
Open

fix(remote): remove -U flag from auto-injected sagemaker requirements install#5922
nileshpatil6 wants to merge 1 commit into
aws:masterfrom
nileshpatil6:fix/remote-decorator-drop-pip-upgrade-flag

Conversation

@nileshpatil6
Copy link
Copy Markdown

Fixes #5872

What happened

The @remote (and @step) decorator injects sagemaker>=3.2.0,<4.0.0 into a temporary requirements file, then installs it with:

pip install -r <requirements_file> -U

The -U flag forces pip to upgrade sagemaker to the latest version within that range, even when a compatible version is already installed. The logs show this clearly:

Requirement already satisfied: sagemaker<4.0.0,>=3.2.0 ... (3.5.0)
Collecting sagemaker<4.0.0,>=3.2.0 ...
Found existing installation: sagemaker 3.5.0
Uninstalling sagemaker-3.5.0: Successfully uninstalled sagemaker-3.5.0
Successfully installed sagemaker-3.11.0

After the forced upgrade, the container tries to deserialize a payload serialized by the 3.5.0 client using a different format, which throws DeserializationError.

Fix

Remove the -U flag from _install_requirements_txt and _install_req_txt_in_conda_env in both sagemaker-core and sagemaker-train. The version constraint (>=3.2.0,<4.0.0) already guarantees a compatible version will be present; there is no reason to force an upgrade.

Files changed

  • sagemaker-core/src/sagemaker/core/remote_function/runtime_environment/runtime_environment_manager.py
  • sagemaker-train/src/sagemaker/train/remote_function/runtime_environment/runtime_environment_manager.py

Testing

Existing unit tests pass (47 passed in test_runtime_environment_manager.py for sagemaker-core). No test changes needed since the tests mock _run_shell_cmd and verify it is called, which still holds.

The @Remote decorator injects sagemaker>=3.2.0,<4.0.0 into a
requirements file and installed it with pip install -r ... -U. The -U
flag forces pip to upgrade sagemaker to the latest version within the
range even if a compatible version is already installed. This created a
version mismatch between the client (which serialized the function at
version 3.5.0) and the container (which deserialized at 3.11.0),
causing DeserializationError.

Remove the -U flag from _install_requirements_txt and
_install_req_txt_in_conda_env in both sagemaker-core and sagemaker-train.
The version constraint already ensures compatibility; forced upgrades are
not needed and actively harmful when the serialization format changes
between minor versions.

Fixes aws#5872

Signed-off-by: nileshpatil6 <technil6436@gmail.com>
Copy link
Copy Markdown

@Mattral Mattral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the forced upgrade issue. That makes sense given the DeserializationError.. I am trying to fugure out and learn something here from your PR,
One question: without the -U flag, how do we ensure environments don’t silently stick to an older but still compatible 3.x version that might miss important bug fixes? Curious if that trade‑off was considered during testing.

I wonder about the long‑term maintenance angle: do we risk jobs running with older 3.x versions that technically satisfy the constraint but lack fixes? The contributing guidelines emphasize guarding against future breaking changes; would an integration test around version pinning help catch regressions?

@nileshpatil6
Copy link
Copy Markdown
Author

Good question. The trade-off is intentional, and it comes down to what the auto-injected constraint is for.

The @remote flow needs the container to deserialize a function that the client serialized. cloudpickle deserialization is sensitive to the sagemaker version: if the client serialized with 3.5.0 and the container deserializes with 3.11.0, you can get a DeserializationError even though both are valid 3.x releases. So for this path, matching the client environment matters more than being on the newest patch.

With -U, pip upgrades to the newest version in >=3.2.0,<4.0.0 on every job, which is exactly what causes the drift away from the client version. Without -U, pip still installs sagemaker if it is missing, and still respects the >=3.2.0,<4.0.0 floor, so an environment cannot silently stay on an incompatible or pre-3.2 version. It just stops force-upgrading past whatever compatible version is already present.

If a user wants a specific newer version in the container, they can still pin it explicitly via dependencies (a requirements.txt or a pinned sagemaker==x.y.z), which takes precedence. So this change removes the surprise upgrade without removing the floor or the ability to opt into a newer version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

@remote decorator forces sagemaker upgrade via pip install -U, causing DeserializationError

2 participants