Skip to content

fix: actionable error when model weights bucket is inaccessible#842

Open
dm36 wants to merge 3 commits into
mainfrom
dhruv/s3-checkpoint-access-error
Open

fix: actionable error when model weights bucket is inaccessible#842
dm36 wants to merge 3 commits into
mainfrom
dhruv/s3-checkpoint-access-error

Conversation

@dm36

@dm36 dm36 commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

Diagnosed from a real failed deploy: an endpoint create returned the opaque {"error":"Internal error occurred. Our team has been notified.","request_id":...} 500. The logged traceback showed the true cause was an S3 AccessDenied during model-weight discovery:

create_model_endpoint → create_vllm_bundle → load_model_weights_sub_commands_s3
  → S3LLMArtifactGateway.list_files → boto3 ListObjects
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling
ListObjects ... not authorized to perform: s3:ListBucket on "arn:aws:s3:::scale-ml"

list_files let the raw ClientError propagate. Because it isn't a DomainException, it bypassed every per-route exception handler and fell through to the global catch-all 500 — so the deploy reported a generic error and the real cause (a bucket the deployment role can't read) was only visible in logs by request_id.

Change

Catch ClientError in S3LLMArtifactGateway.list_files. For access/missing errors (AccessDenied, NoSuchBucket, 403, 404), raise ObjectHasInvalidValueException with a path-only message:

Could not read model weights at '<path>'. Check that the path exists and that the deployment has read access to the bucket.

The create/update endpoint handlers already map ObjectHasInvalidValueException400, so this surfaces as an actionable client error with no new wiring. The IAM role ARN from the raw error is never surfaced (only the path); full detail is still logged server-side. Non-access ClientErrors (e.g. throttling/SlowDown) are re-raised unchanged so transient/internal failures still behave as before.

Test plan

  • list_files success path (baseline).
  • list_files with S3 AccessDenied raises ObjectHasInvalidValueException, message contains the path and does not contain the IAM role ARN.
  • black / ruff / isort clean on changed files.

Context

Companion to #840 (async-deploy status_reason) and #841 (middleware error surfacing). This one fixes the specific opaque-500 at its source — the most common shape being an inaccessible/incorrect checkpoint bucket.

🤖 Generated with Claude Code

Greptile Summary

  • Converts selected S3 model-weight access and missing-path ClientErrors into sanitized ObjectHasInvalidValueExceptions.
  • Keeps transient or non-access S3 ClientErrors propagating unchanged.
  • Adds unit coverage for successful listing, sanitized access failures, config download failures, and passthrough behavior.

Confidence Score: 5/5

The change is narrowly scoped to S3 listing error translation and preserves existing behavior for successful and transient failure paths.

Unit coverage exercises the intended sanitized client error behavior, unchanged passthrough for non-access S3 errors, and the existing successful listing path.

T-Rex T-Rex Logs

What T-Rex did

  • Ran the base-commit S3 list-files error contract to capture the command, working directory, full script, and the initial output, where AccessDenied and NoSuchBucket appeared as ClientError.
  • Ran the head-commit S3 list-files error contract to compare against the base run, where AccessDenied and NoSuchBucket appeared as ObjectHasInvalidValueException and referenced the s3://scale-ml/models/checkpoint path.
  • Both executions exited with exit code 0.
  • The head-commit run returned the expected keys ['models/checkpoint/a.bin', 'models/checkpoint/sub/b.json'].

View all artifacts

T-Rex Ran code and verified through T-Rex

Reviews (3): Last reviewed commit: "fix: also map inaccessible model config ..." | Re-trigger Greptile

When the configured checkpoint path points at an S3 bucket the deployment's
role can't read (or that doesn't exist), `list_files` raised a raw
botocore ClientError. It wasn't a DomainException, so it bypassed every
per-route handler and surfaced as an opaque 500 ("Internal error occurred")
— the actual cause (an S3 AccessDenied on weight discovery) only visible in
logs.

Catch ClientError in S3LLMArtifactGateway.list_files and, for access/missing
errors (AccessDenied / NoSuchBucket / 403 / 404), raise
ObjectHasInvalidValueException with a path-only message — which the
create/update endpoint handlers already map to a 400. The IAM role ARN from
the raw error is never surfaced; full detail is still logged. Other client
errors (e.g. throttling) are re-raised unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread model-engine/model_engine_server/infra/gateways/s3_llm_artifact_gateway.py Outdated
dm36 and others added 2 commits June 22, 2026 13:43
Address review: the access-error branch dropped the raw S3 failure detail
(AWS message, request context, IAM role, traceback) that operators rely on.
Log the caught ClientError with exc_info=True before raising the sanitized,
path-only user-facing error.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend the S3 checkpoint-access handling to get_model_config, which downloads
the checkpoint's config.json and previously let a raw ClientError (e.g.
NoSuchKey for a missing config, or AccessDenied) propagate to the opaque 500.

Extract the translation into a shared _raise_if_checkpoint_inaccessible helper
used by both list_files and get_model_config: it logs the full S3 error
server-side and raises a sanitized, path-only ObjectHasInvalidValueException
(mapped to 400 by the route handlers), re-raising non-access errors unchanged.
Also covers NoSuchKey.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant