fix(network-volume): re-resolve stale cached volume id by name#348
Conversation
A cached NetworkVolume id (.flash/resources.pkl, .runpod/resources.pkl) read as deployed even after the volume was deleted on Runpod, so the stale id flowed into GraphQL provisioning and hard-failed with "Network volume \"<id>\" not found". is_deployed() now validates the cached id against the live API, mirroring ServerlessResource.is_deployed(). A stale id reads as not-deployed, so both early-return sites (ResourceManager and _do_deploy) fall back to resolve-by-name and create-if-missing, matching fresh-checkout behavior. _do_deploy() clears the stale id before recreating so it is not sent in the create payload, and raises a clear error for an unrecoverable id-only volume. Fixes SLS-337
|
Capy auto-review is paused for this organization because the usage-cycle auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews. |
|
Promptless prepared a documentation update related to this change. Triggered by flash PR #348 (fix: re-resolve stale cached volume id by name, SLS-337) Since this fix introduces a user-visible behavior difference between referencing a network volume by Review: Document how Flash resolves network volumes by name or ID |
There was a problem hiding this comment.
Pull request overview
This PR fixes a failure mode where NetworkVolume could treat a stale locally-cached volume ID as “deployed” even after the volume was deleted out-of-band on Runpod, causing downstream provisioning to hard-fail instead of re-resolving by name (SLS-337).
Changes:
- Updated
NetworkVolume.is_deployed()to validate the cachedidagainst the live Runpod API (list_network_volumes) and treat missing/stale IDs as not deployed. - Updated
_do_deploy()to (a) clear a staleidbefore creating a new volume by name, and (b) raise a clear error for id-only volumes that can’t be re-resolved. - Added unit tests covering stale/valid/no-id
is_deployed()behavior and_do_deploy()recreation/reuse/error paths.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/runpod_flash/core/resources/network_volume.py |
Validates cached volume IDs against the live API, clears stale IDs before create, and errors for unrecoverable id-only volumes. |
tests/unit/resources/test_network_volume.py |
Adds targeted async unit tests for the new stale-ID validation and deploy behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ched-volume-id-hard-fails-instead-of
Summary
NetworkVolume(name=...)caches its resolved volume id locally (.flash/resources.pkl,.runpod/resources.pkl). If the volume was later deleted on Runpod out of band, the stale id read as "deployed" and flowed straight into GraphQL provisioning, hard-failing withNetwork volume "<id>" not found(and a 503 in live mode) instead of re-resolving by name like a fresh checkout does.Fixes SLS-337.
Root cause
NetworkVolume.is_deployed()only checkedid is not None. Both early-return sites —ResourceManager.get_or_deploy_resourceandNetworkVolume._do_deploy— trusted that result, so a stale cached id was never validated against the live API.Changes
is_deployed()now validates the cached id against the live API (list_network_volumes), mirroringServerlessResource.is_deployed(). A stale id reads as not-deployed, so both early-return sites fall back to resolve-by-name + create-if-missing, matching first-run behavior. Transient API errors returnFalse, deferring to the idempotent resolve-by-name path._do_deploy()clears the stale id before recreating (so it isn't sent in the create payload) and raises a clear error for an unrecoverable id-only volume (no name to re-resolve by).Test plan
tests/unit/resources/test_network_volume.py(TDD, written failing-first): stale-id reads as not-deployed, valid-id reads as deployed, no-id short-circuits without an API call,_do_deployrecreates by name when the cached id is stale (asserts the create payload drops the stale id), reuses on a valid id, and raises for a stale id-only volume.make quality-check: ruff format + lint clean, full suite 2568 passed under xdist, coverage 86% (threshold 65%).