Skip to content

feat: record and expose status_reason on model endpoints#839

Closed
dm36 wants to merge 1 commit into
scaleapi:mainfrom
dm36:dhruv/endpoint-status-reason
Closed

feat: record and expose status_reason on model endpoints#839
dm36 wants to merge 1 commit into
scaleapi:mainfrom
dm36:dhruv/endpoint-status-reason

Conversation

@dm36

@dm36 dm36 commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

When an endpoint build fails (status UPDATE_FAILED), the cause is only logged and discarded — the GET model-endpoint API exposes status but no reason. A downstream consumer (Scale's SGP / GenAI platform) polls this endpoint to surface deploy results to users, so today a failed deploy can only show a generic "failed" message with no actionable cause (CUDA OOM, bad checkpoint, image pull failure, etc.).

This threads a status_reason through the stack so the failure cause is persisted and returned by the API:

  • Add nullable status_reason column to the endpoints table (+ Alembic migration).
  • Add status_reason to the ORM model, ModelEndpointRecord entity, the ORM→entity translation, and the create/update repository methods.
  • Expose status_reason on GetModelEndpointV1Response and map it in the use case.
  • Capture the cause at the UPDATE_FAILED sites (sanitized build error, image-build failure, infra-delete failure) and clear it when an endpoint returns to READY.

Test plan

  • Unit test: a failed endpoint's status_reason flows through to GetModelEndpointV1Response.
  • Maintainer verification: full unit suite + Alembic migration run (the author couldn't install deps locally — the repo's pip registry requires CodeArtifact auth; changes are byte-compiled clean and logic-validated standalone).

Notes

Opened from a fork while repo write access is pending. Happy to adjust field naming / sanitization length / which call sites capture a reason to match team conventions.

🤖 Generated with Claude Code

Greptile Summary

This PR threads a status_reason field through the full stack — DB column, ORM model, domain entity, repository, and API response DTO — so that human-readable failure causes (e.g., CUDA OOM, image pull errors) are persisted on UPDATE_FAILED endpoints and returned by GET /model-endpoint.

  • Schema + migration: adds a nullable Text column status_reason to hosted_model_inference.endpoints with a correct Alembic migration chained to the previous revision.
  • Capture sites: live_endpoint_builder_service records str(error) (whitespace-collapsed, 500-char capped) on the generic build-failure path and a static message on image-build failure; live_model_endpoint_service records a static message on infra-delete failure.
  • Clearing: setting status_reason="" in update_model_endpoint_record is converted to SQL NULL so the field is cleared when an endpoint returns to READY; the semantics differ from every other nullable parameter in that signature (None = leave unchanged, "" = clear).

Confidence Score: 3/5

Safe to merge with the understanding that raw exception messages from infrastructure operations will become visible to endpoint owners via the GET API; review the str(error) handling before exposing to external users.

The data flow is clean and the migration is correctly chained. However, the build-failure path passes str(error) from a broad exception handler directly into a user-visible API field. Exceptions from Kubernetes, AWS, or database layers can contain connection strings, internal IPs, or resource names that should not be surfaced in a public API response. The image-build and delete-failure paths use static strings (safe), but the generic catch-all path is the one most likely to fire in practice.

model-engine/model_engine_server/infra/services/live_endpoint_builder_service.py — the generic exception handler that serialises str(error) into status_reason

Security Review

  • Information disclosure via str(error) (live_endpoint_builder_service.py): The broad except Exception block passes str(error) directly as status_reason. This can surface internal infrastructure details — Kubernetes resource names, internal IP addresses, AWS ARNs, database connection strings — to API consumers through the public GET /model-endpoint response. Sanitization only normalises whitespace and caps length; no content filtering is applied.

Important Files Changed

Filename Overview
model-engine/model_engine_server/infra/services/live_endpoint_builder_service.py Captures status_reason at UPDATE_FAILED and READY sites; raw str(error) may expose sensitive infrastructure internals to API consumers
model-engine/model_engine_server/infra/repositories/db_model_endpoint_record_repository.py Adds status_reason to create/update paths with a non-obvious empty-string sentinel for clearing; logic is correct but the API contract is easy to misuse
model-engine/model_engine_server/db/migrations/alembic/versions/2026_06_16_1200-c4d5e6f7a8b9_add_status_reason_column.py Correct nullable Text column addition with matching down_revision pointing to the prior migration
model-engine/model_engine_server/common/dtos/model_endpoints.py Adds optional status_reason field to GetModelEndpointV1Response with clear description
model-engine/model_engine_server/domain/entities/model_endpoint_entity.py Adds nullable status_reason field to ModelEndpointRecord entity with correct default
model-engine/model_engine_server/domain/use_cases/model_endpoint_use_cases.py Threads status_reason from the entity record into the API response DTO
model-engine/model_engine_server/infra/repositories/model_endpoint_record_repository.py Adds status_reason parameter to the abstract repository interface with docstring updates
model-engine/model_engine_server/db/models/hosted_model_inference.py Adds nullable Text status_reason column to the Endpoint ORM model and init signature
model-engine/model_engine_server/infra/services/live_model_endpoint_service.py Sets a static status_reason for the infra-delete failure case; safe message with no dynamic content
model-engine/tests/unit/domain/test_model_endpoint_use_cases.py Adds a focused test verifying status_reason flows through from the entity to the API response

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Builder as LiveEndpointBuilderService
    participant Repo as DbModelEndpointRecordRepository
    participant DB as PostgreSQL (endpoints)
    participant API as GET /model-endpoint

    Builder->>Repo: "update_model_endpoint_record(status=READY, status_reason="")"
    Repo->>DB: "UPDATE SET status_reason=NULL"
    Note over DB: status_reason cleared on success

    Builder->>Repo: "update_model_endpoint_record(status=UPDATE_FAILED, status_reason=sanitized_error)"
    Repo->>DB: "UPDATE SET status_reason='CUDA out of memory…'"
    Note over DB: reason persisted on failure

    API->>DB: "SELECT * FROM endpoints WHERE id=?"
    DB-->>API: "{status: UPDATE_FAILED, status_reason: 'CUDA out of memory…'}"
    API-->>API: "GetModelEndpointV1Response{status_reason: '…'}"
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Builder as LiveEndpointBuilderService
    participant Repo as DbModelEndpointRecordRepository
    participant DB as PostgreSQL (endpoints)
    participant API as GET /model-endpoint

    Builder->>Repo: "update_model_endpoint_record(status=READY, status_reason="")"
    Repo->>DB: "UPDATE SET status_reason=NULL"
    Note over DB: status_reason cleared on success

    Builder->>Repo: "update_model_endpoint_record(status=UPDATE_FAILED, status_reason=sanitized_error)"
    Repo->>DB: "UPDATE SET status_reason='CUDA out of memory…'"
    Note over DB: reason persisted on failure

    API->>DB: "SELECT * FROM endpoints WHERE id=?"
    DB-->>API: "{status: UPDATE_FAILED, status_reason: 'CUDA out of memory…'}"
    API-->>API: "GetModelEndpointV1Response{status_reason: '…'}"
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
model-engine/model_engine_server/infra/services/live_endpoint_builder_service.py:367-368
**Unfiltered `str(error)` may leak sensitive infrastructure details**

The broad `except Exception` block catches all errors from infrastructure operations (Kubernetes API calls, AWS SDK calls, database operations, etc.). Passing `str(error)` directly as `status_reason` can expose internal connection strings, internal IP addresses, resource names, or database schema details to API consumers via the public `GET /model-endpoint` response. The `_sanitize_status_reason` helper only collapses whitespace and caps length — it performs no content filtering. Consider catching specific known exception types and mapping them to safe, user-facing messages, or stripping exception types from a known-internal class hierarchy before persisting the message.

### Issue 2 of 2
model-engine/model_engine_server/infra/repositories/db_model_endpoint_record_repository.py:342-346
**Empty-string-as-sentinel is a non-obvious API contract**

The update method uses `None` to mean "leave the existing value unchanged" and `""` to mean "clear to NULL". This is documented in a comment but differs from every other nullable parameter in the same method signature (e.g., `status`, `destination`, `metadata`) where `None` always means "do not update". A future caller who passes `status_reason=None` intending to clear the field will silently leave a stale failure reason. Consider using a dedicated sentinel (e.g., a module-level `CLEAR = object()`) or a separate `clear_status_reason: bool = False` parameter to make the intent explicit.

Reviews (1): Last reviewed commit: "feat: record and expose status_reason on..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

When an endpoint build fails (status UPDATE_FAILED), the cause was only logged
and discarded — the GET model-endpoint API exposed status but no reason, so
downstream consumers (e.g. SGP) could only show a generic failure message.

Thread a status_reason through the stack so the failure cause is persisted and
returned by the API:
- add nullable status_reason column to the endpoints table (+ Alembic migration)
- add status_reason to the ORM model, ModelEndpointRecord entity, the ORM->entity
  translation, and the create/update repository methods
- expose status_reason on GetModelEndpointV1Response and map it in the use case
- capture the cause at the UPDATE_FAILED sites (sanitized build error, image-build
  failure, infra-delete failure) and clear it when an endpoint returns to READY

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment on lines +367 to 368
status_reason=_sanitize_status_reason(str(error)),
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security Unfiltered str(error) may leak sensitive infrastructure details

The broad except Exception block catches all errors from infrastructure operations (Kubernetes API calls, AWS SDK calls, database operations, etc.). Passing str(error) directly as status_reason can expose internal connection strings, internal IP addresses, resource names, or database schema details to API consumers via the public GET /model-endpoint response. The _sanitize_status_reason helper only collapses whitespace and caps length — it performs no content filtering. Consider catching specific known exception types and mapping them to safe, user-facing messages, or stripping exception types from a known-internal class hierarchy before persisting the message.

Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/infra/services/live_endpoint_builder_service.py
Line: 367-368

Comment:
**Unfiltered `str(error)` may leak sensitive infrastructure details**

The broad `except Exception` block catches all errors from infrastructure operations (Kubernetes API calls, AWS SDK calls, database operations, etc.). Passing `str(error)` directly as `status_reason` can expose internal connection strings, internal IP addresses, resource names, or database schema details to API consumers via the public `GET /model-endpoint` response. The `_sanitize_status_reason` helper only collapses whitespace and caps length — it performs no content filtering. Consider catching specific known exception types and mapping them to safe, user-facing messages, or stripping exception types from a known-internal class hierarchy before persisting the message.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Comment on lines +342 to +346
# status_reason is handled separately from dict_not_none so an explicit
# clear (empty string -> NULL) is possible when an endpoint recovers to a
# healthy state, while omitting it (None) leaves any existing reason intact.
if status_reason is not None:
update_kwargs["status_reason"] = status_reason or None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Empty-string-as-sentinel is a non-obvious API contract

The update method uses None to mean "leave the existing value unchanged" and "" to mean "clear to NULL". This is documented in a comment but differs from every other nullable parameter in the same method signature (e.g., status, destination, metadata) where None always means "do not update". A future caller who passes status_reason=None intending to clear the field will silently leave a stale failure reason. Consider using a dedicated sentinel (e.g., a module-level CLEAR = object()) or a separate clear_status_reason: bool = False parameter to make the intent explicit.

Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/infra/repositories/db_model_endpoint_record_repository.py
Line: 342-346

Comment:
**Empty-string-as-sentinel is a non-obvious API contract**

The update method uses `None` to mean "leave the existing value unchanged" and `""` to mean "clear to NULL". This is documented in a comment but differs from every other nullable parameter in the same method signature (e.g., `status`, `destination`, `metadata`) where `None` always means "do not update". A future caller who passes `status_reason=None` intending to clear the field will silently leave a stale failure reason. Consider using a dedicated sentinel (e.g., a module-level `CLEAR = object()`) or a separate `clear_status_reason: bool = False` parameter to make the intent explicit.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fix in Cursor Fix in Claude Code Fix in Codex

@dm36

dm36 commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator Author

Superseded by #840, opened from a branch directly on this repo now that I have write access. Closing this fork-based PR.

@dm36 dm36 closed this Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant