fix: actionable error when model weights bucket is inaccessible#842
Open
dm36 wants to merge 3 commits into
Open
fix: actionable error when model weights bucket is inaccessible#842dm36 wants to merge 3 commits into
dm36 wants to merge 3 commits into
Conversation
When the configured checkpoint path points at an S3 bucket the deployment's
role can't read (or that doesn't exist), `list_files` raised a raw
botocore ClientError. It wasn't a DomainException, so it bypassed every
per-route handler and surfaced as an opaque 500 ("Internal error occurred")
— the actual cause (an S3 AccessDenied on weight discovery) only visible in
logs.
Catch ClientError in S3LLMArtifactGateway.list_files and, for access/missing
errors (AccessDenied / NoSuchBucket / 403 / 404), raise
ObjectHasInvalidValueException with a path-only message — which the
create/update endpoint handlers already map to a 400. The IAM role ARN from
the raw error is never surfaced; full detail is still logged. Other client
errors (e.g. throttling) are re-raised unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review: the access-error branch dropped the raw S3 failure detail (AWS message, request context, IAM role, traceback) that operators rely on. Log the caught ClientError with exc_info=True before raising the sanitized, path-only user-facing error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend the S3 checkpoint-access handling to get_model_config, which downloads the checkpoint's config.json and previously let a raw ClientError (e.g. NoSuchKey for a missing config, or AccessDenied) propagate to the opaque 500. Extract the translation into a shared _raise_if_checkpoint_inaccessible helper used by both list_files and get_model_config: it logs the full S3 error server-side and raises a sanitized, path-only ObjectHasInvalidValueException (mapped to 400 by the route handlers), re-raising non-access errors unchanged. Also covers NoSuchKey. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Diagnosed from a real failed deploy: an endpoint create returned the opaque
{"error":"Internal error occurred. Our team has been notified.","request_id":...}500. The logged traceback showed the true cause was an S3AccessDeniedduring model-weight discovery:list_fileslet the rawClientErrorpropagate. Because it isn't aDomainException, it bypassed every per-route exception handler and fell through to the global catch-all 500 — so the deploy reported a generic error and the real cause (a bucket the deployment role can't read) was only visible in logs byrequest_id.Change
Catch
ClientErrorinS3LLMArtifactGateway.list_files. For access/missing errors (AccessDenied,NoSuchBucket,403,404), raiseObjectHasInvalidValueExceptionwith a path-only message:The create/update endpoint handlers already map
ObjectHasInvalidValueException→ 400, so this surfaces as an actionable client error with no new wiring. The IAM role ARN from the raw error is never surfaced (only the path); full detail is still logged server-side. Non-accessClientErrors (e.g. throttling/SlowDown) are re-raised unchanged so transient/internal failures still behave as before.Test plan
list_filessuccess path (baseline).list_fileswith S3AccessDeniedraisesObjectHasInvalidValueException, message contains the path and does not contain the IAM role ARN.Context
Companion to #840 (async-deploy
status_reason) and #841 (middleware error surfacing). This one fixes the specific opaque-500 at its source — the most common shape being an inaccessible/incorrect checkpoint bucket.🤖 Generated with Claude Code
Greptile Summary
ClientErrors into sanitizedObjectHasInvalidValueExceptions.ClientErrors propagating unchanged.Confidence Score: 5/5
The change is narrowly scoped to S3 listing error translation and preserves existing behavior for successful and transient failure paths.
Unit coverage exercises the intended sanitized client error behavior, unchanged passthrough for non-access S3 errors, and the existing successful listing path.
What T-Rex did
Reviews (3): Last reviewed commit: "fix: also map inaccessible model config ..." | Re-trigger Greptile