feat: record and expose status_reason on model endpoints#839
Conversation
When an endpoint build fails (status UPDATE_FAILED), the cause was only logged and discarded — the GET model-endpoint API exposed status but no reason, so downstream consumers (e.g. SGP) could only show a generic failure message. Thread a status_reason through the stack so the failure cause is persisted and returned by the API: - add nullable status_reason column to the endpoints table (+ Alembic migration) - add status_reason to the ORM model, ModelEndpointRecord entity, the ORM->entity translation, and the create/update repository methods - expose status_reason on GetModelEndpointV1Response and map it in the use case - capture the cause at the UPDATE_FAILED sites (sanitized build error, image-build failure, infra-delete failure) and clear it when an endpoint returns to READY Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| status_reason=_sanitize_status_reason(str(error)), | ||
| ) |
There was a problem hiding this comment.
Unfiltered
str(error) may leak sensitive infrastructure details
The broad except Exception block catches all errors from infrastructure operations (Kubernetes API calls, AWS SDK calls, database operations, etc.). Passing str(error) directly as status_reason can expose internal connection strings, internal IP addresses, resource names, or database schema details to API consumers via the public GET /model-endpoint response. The _sanitize_status_reason helper only collapses whitespace and caps length — it performs no content filtering. Consider catching specific known exception types and mapping them to safe, user-facing messages, or stripping exception types from a known-internal class hierarchy before persisting the message.
Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/infra/services/live_endpoint_builder_service.py
Line: 367-368
Comment:
**Unfiltered `str(error)` may leak sensitive infrastructure details**
The broad `except Exception` block catches all errors from infrastructure operations (Kubernetes API calls, AWS SDK calls, database operations, etc.). Passing `str(error)` directly as `status_reason` can expose internal connection strings, internal IP addresses, resource names, or database schema details to API consumers via the public `GET /model-endpoint` response. The `_sanitize_status_reason` helper only collapses whitespace and caps length — it performs no content filtering. Consider catching specific known exception types and mapping them to safe, user-facing messages, or stripping exception types from a known-internal class hierarchy before persisting the message.
How can I resolve this? If you propose a fix, please make it concise.| # status_reason is handled separately from dict_not_none so an explicit | ||
| # clear (empty string -> NULL) is possible when an endpoint recovers to a | ||
| # healthy state, while omitting it (None) leaves any existing reason intact. | ||
| if status_reason is not None: | ||
| update_kwargs["status_reason"] = status_reason or None |
There was a problem hiding this comment.
Empty-string-as-sentinel is a non-obvious API contract
The update method uses None to mean "leave the existing value unchanged" and "" to mean "clear to NULL". This is documented in a comment but differs from every other nullable parameter in the same method signature (e.g., status, destination, metadata) where None always means "do not update". A future caller who passes status_reason=None intending to clear the field will silently leave a stale failure reason. Consider using a dedicated sentinel (e.g., a module-level CLEAR = object()) or a separate clear_status_reason: bool = False parameter to make the intent explicit.
Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/infra/repositories/db_model_endpoint_record_repository.py
Line: 342-346
Comment:
**Empty-string-as-sentinel is a non-obvious API contract**
The update method uses `None` to mean "leave the existing value unchanged" and `""` to mean "clear to NULL". This is documented in a comment but differs from every other nullable parameter in the same method signature (e.g., `status`, `destination`, `metadata`) where `None` always means "do not update". A future caller who passes `status_reason=None` intending to clear the field will silently leave a stale failure reason. Consider using a dedicated sentinel (e.g., a module-level `CLEAR = object()`) or a separate `clear_status_reason: bool = False` parameter to make the intent explicit.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
Superseded by #840, opened from a branch directly on this repo now that I have write access. Closing this fork-based PR. |
Summary
When an endpoint build fails (status
UPDATE_FAILED), the cause is only logged and discarded — the GET model-endpoint API exposesstatusbut no reason. A downstream consumer (Scale's SGP / GenAI platform) polls this endpoint to surface deploy results to users, so today a failed deploy can only show a generic "failed" message with no actionable cause (CUDA OOM, bad checkpoint, image pull failure, etc.).This threads a
status_reasonthrough the stack so the failure cause is persisted and returned by the API:status_reasoncolumn to theendpointstable (+ Alembic migration).status_reasonto the ORM model,ModelEndpointRecordentity, the ORM→entity translation, and the create/update repository methods.status_reasononGetModelEndpointV1Responseand map it in the use case.UPDATE_FAILEDsites (sanitized build error, image-build failure, infra-delete failure) and clear it when an endpoint returns toREADY.Test plan
status_reasonflows through toGetModelEndpointV1Response.Notes
Opened from a fork while repo write access is pending. Happy to adjust field naming / sanitization length / which call sites capture a reason to match team conventions.
🤖 Generated with Claude Code
Greptile Summary
This PR threads a
status_reasonfield through the full stack — DB column, ORM model, domain entity, repository, and API response DTO — so that human-readable failure causes (e.g., CUDA OOM, image pull errors) are persisted onUPDATE_FAILEDendpoints and returned byGET /model-endpoint.Textcolumnstatus_reasontohosted_model_inference.endpointswith a correct Alembic migration chained to the previous revision.live_endpoint_builder_servicerecordsstr(error)(whitespace-collapsed, 500-char capped) on the generic build-failure path and a static message on image-build failure;live_model_endpoint_servicerecords a static message on infra-delete failure.status_reason=""inupdate_model_endpoint_recordis converted to SQLNULLso the field is cleared when an endpoint returns toREADY; the semantics differ from every other nullable parameter in that signature (None= leave unchanged,""= clear).Confidence Score: 3/5
Safe to merge with the understanding that raw exception messages from infrastructure operations will become visible to endpoint owners via the GET API; review the str(error) handling before exposing to external users.
The data flow is clean and the migration is correctly chained. However, the build-failure path passes str(error) from a broad exception handler directly into a user-visible API field. Exceptions from Kubernetes, AWS, or database layers can contain connection strings, internal IPs, or resource names that should not be surfaced in a public API response. The image-build and delete-failure paths use static strings (safe), but the generic catch-all path is the one most likely to fire in practice.
model-engine/model_engine_server/infra/services/live_endpoint_builder_service.py — the generic exception handler that serialises str(error) into status_reason
Security Review
str(error)(live_endpoint_builder_service.py): The broadexcept Exceptionblock passesstr(error)directly asstatus_reason. This can surface internal infrastructure details — Kubernetes resource names, internal IP addresses, AWS ARNs, database connection strings — to API consumers through the publicGET /model-endpointresponse. Sanitization only normalises whitespace and caps length; no content filtering is applied.Important Files Changed
Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant Builder as LiveEndpointBuilderService participant Repo as DbModelEndpointRecordRepository participant DB as PostgreSQL (endpoints) participant API as GET /model-endpoint Builder->>Repo: "update_model_endpoint_record(status=READY, status_reason="")" Repo->>DB: "UPDATE SET status_reason=NULL" Note over DB: status_reason cleared on success Builder->>Repo: "update_model_endpoint_record(status=UPDATE_FAILED, status_reason=sanitized_error)" Repo->>DB: "UPDATE SET status_reason='CUDA out of memory…'" Note over DB: reason persisted on failure API->>DB: "SELECT * FROM endpoints WHERE id=?" DB-->>API: "{status: UPDATE_FAILED, status_reason: 'CUDA out of memory…'}" API-->>API: "GetModelEndpointV1Response{status_reason: '…'}"%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant Builder as LiveEndpointBuilderService participant Repo as DbModelEndpointRecordRepository participant DB as PostgreSQL (endpoints) participant API as GET /model-endpoint Builder->>Repo: "update_model_endpoint_record(status=READY, status_reason="")" Repo->>DB: "UPDATE SET status_reason=NULL" Note over DB: status_reason cleared on success Builder->>Repo: "update_model_endpoint_record(status=UPDATE_FAILED, status_reason=sanitized_error)" Repo->>DB: "UPDATE SET status_reason='CUDA out of memory…'" Note over DB: reason persisted on failure API->>DB: "SELECT * FROM endpoints WHERE id=?" DB-->>API: "{status: UPDATE_FAILED, status_reason: 'CUDA out of memory…'}" API-->>API: "GetModelEndpointV1Response{status_reason: '…'}"Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "feat: record and expose status_reason on..." | Re-trigger Greptile