Skip to content

fix(network-volume): re-resolve stale cached volume id by name#348

Merged
deanq merged 2 commits into
mainfrom
deanquinanola/sls-337-networkvolume-stale-cached-volume-id-hard-fails-instead-of
Jun 29, 2026
Merged

fix(network-volume): re-resolve stale cached volume id by name#348
deanq merged 2 commits into
mainfrom
deanquinanola/sls-337-networkvolume-stale-cached-volume-id-hard-fails-instead-of

Conversation

@deanq

@deanq deanq commented Jun 29, 2026

Copy link
Copy Markdown
Member

Summary

NetworkVolume(name=...) caches its resolved volume id locally (.flash/resources.pkl, .runpod/resources.pkl). If the volume was later deleted on Runpod out of band, the stale id read as "deployed" and flowed straight into GraphQL provisioning, hard-failing with Network volume "<id>" not found (and a 503 in live mode) instead of re-resolving by name like a fresh checkout does.

Fixes SLS-337.

Root cause

NetworkVolume.is_deployed() only checked id is not None. Both early-return sites — ResourceManager.get_or_deploy_resource and NetworkVolume._do_deploy — trusted that result, so a stale cached id was never validated against the live API.

Changes

  • is_deployed() now validates the cached id against the live API (list_network_volumes), mirroring ServerlessResource.is_deployed(). A stale id reads as not-deployed, so both early-return sites fall back to resolve-by-name + create-if-missing, matching first-run behavior. Transient API errors return False, deferring to the idempotent resolve-by-name path.
  • _do_deploy() clears the stale id before recreating (so it isn't sent in the create payload) and raises a clear error for an unrecoverable id-only volume (no name to re-resolve by).

Test plan

  • 6 new unit tests in tests/unit/resources/test_network_volume.py (TDD, written failing-first): stale-id reads as not-deployed, valid-id reads as deployed, no-id short-circuits without an API call, _do_deploy recreates by name when the cached id is stale (asserts the create payload drops the stale id), reuses on a valid id, and raises for a stale id-only volume.
  • make quality-check: ruff format + lint clean, full suite 2568 passed under xdist, coverage 86% (threshold 65%).

A cached NetworkVolume id (.flash/resources.pkl, .runpod/resources.pkl)
read as deployed even after the volume was deleted on Runpod, so the
stale id flowed into GraphQL provisioning and hard-failed with
"Network volume \"<id>\" not found".

is_deployed() now validates the cached id against the live API, mirroring
ServerlessResource.is_deployed(). A stale id reads as not-deployed, so
both early-return sites (ResourceManager and _do_deploy) fall back to
resolve-by-name and create-if-missing, matching fresh-checkout behavior.
_do_deploy() clears the stale id before recreating so it is not sent in
the create payload, and raises a clear error for an unrecoverable
id-only volume.

Fixes SLS-337
@capy-ai

capy-ai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Capy auto-review is paused for this organization because the usage-cycle auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews.

@promptless

promptless Bot commented Jun 29, 2026

Copy link
Copy Markdown

Promptless prepared a documentation update related to this change.

Triggered by flash PR #348 (fix: re-resolve stale cached volume id by name, SLS-337)

Since this fix introduces a user-visible behavior difference between referencing a network volume by name (auto-recovers when deleted out of band) vs by id, the suggestion adds a subsection to the storage docs explaining how Flash resolves volumes by name or ID, the cached ID revalidation, and the new auto-recovery behavior.

Review: Document how Flash resolves network volumes by name or ID

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a failure mode where NetworkVolume could treat a stale locally-cached volume ID as “deployed” even after the volume was deleted out-of-band on Runpod, causing downstream provisioning to hard-fail instead of re-resolving by name (SLS-337).

Changes:

  • Updated NetworkVolume.is_deployed() to validate the cached id against the live Runpod API (list_network_volumes) and treat missing/stale IDs as not deployed.
  • Updated _do_deploy() to (a) clear a stale id before creating a new volume by name, and (b) raise a clear error for id-only volumes that can’t be re-resolved.
  • Added unit tests covering stale/valid/no-id is_deployed() behavior and _do_deploy() recreation/reuse/error paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/runpod_flash/core/resources/network_volume.py Validates cached volume IDs against the live API, clears stale IDs before create, and errors for unrecoverable id-only volumes.
tests/unit/resources/test_network_volume.py Adds targeted async unit tests for the new stale-ID validation and deploy behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@deanq deanq merged commit efb3cc1 into main Jun 29, 2026
8 checks passed
@deanq deanq deleted the deanquinanola/sls-337-networkvolume-stale-cached-volume-id-hard-fails-instead-of branch June 29, 2026 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants