Skip to content

feat: record and expose status_reason on model endpoints#840

Open
dm36 wants to merge 3 commits into
mainfrom
dhruv/endpoint-status-reason
Open

feat: record and expose status_reason on model endpoints#840
dm36 wants to merge 3 commits into
mainfrom
dhruv/endpoint-status-reason

Conversation

@dm36

@dm36 dm36 commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

When an endpoint build fails (status UPDATE_FAILED), the cause is only logged and discarded — the GET model-endpoint API exposes status but no reason. A downstream consumer (Scale's SGP / GenAI platform) polls this endpoint to surface deploy results to users, so today a failed deploy can only show a generic "failed" message with no actionable cause (CUDA OOM, bad checkpoint, image pull failure, etc.).

This threads a status_reason through the stack so the failure cause is persisted and returned by the API:

  • Add nullable status_reason column to the endpoints table (+ Alembic migration).
  • Add status_reason to the ORM model, ModelEndpointRecord entity, the ORM→entity translation, and the create/update repository methods.
  • Expose status_reason on GetModelEndpointV1Response and map it in the use case.
  • Capture the cause at the UPDATE_FAILED sites (sanitized build error, image-build failure, infra-delete failure) and clear it when an endpoint returns to READY.

Test plan

  • Unit test: a failed endpoint's status_reason flows through to GetModelEndpointV1Response.
  • Reviewer note: please run the full unit suite + Alembic migration; the author couldn't install deps locally (CodeArtifact auth). Changes byte-compile clean and logic was validated standalone.

Notes

Open to adjusting field naming / sanitization length / which call sites capture a reason to match team conventions. Supersedes #839 (opened from a fork before access was granted).

🤖 Generated with Claude Code

Greptile Summary

This PR threads a status_reason field through the full stack — DB column, ORM, domain entity, repository, use-case, and API response — so that the cause of an UPDATE_FAILED deployment is surfaced to consumers instead of being silently discarded.

  • DB + ORM: Adds a nullable Text column status_reason to the endpoints table via a correctly-chained Alembic migration, and mirrors it in the SQLAlchemy model and ModelEndpointRecord entity.
  • Capture sites: A new _status_reason_from_error helper gates what is persisted — only DomainException messages (intentionally user-safe) are surfaced; all other exceptions fall back to a generic string. The previous "dead write" before DockerBuildFailedException was raised has been removed, fixing the overwrite issue flagged in an earlier review.
  • Clear on recovery: clear_status_reason=True is threaded through so the column is NULLed when an endpoint returns to READY.

Confidence Score: 5/5

Safe to merge. The change is purely additive: a nullable column with a correct migration, a new optional response field, and narrowly-scoped writes at existing failure paths.

All failure-capture paths are covered, the dead-write from the prior review is resolved, the migration chains correctly, and the DomainException gate prevents raw infra error text from leaking into the API. The one note is that a prior failure reason persists visibly during a subsequent UPDATE_IN_PROGRESS window, but that is a minor UX concern rather than a correctness defect.

No files require special attention. live_endpoint_builder_service.py is the most logic-dense file but the changes there are straightforward.

Important Files Changed

Filename Overview
model-engine/model_engine_server/infra/services/live_endpoint_builder_service.py Adds _status_reason_from_error helper and wires status_reason into the outer exception handler; removes the now-redundant inner update_model_endpoint_record before DockerBuildFailedException. Clears reason on READY but not on UPDATE_IN_PROGRESS, so a prior failure reason can bleed into a retry window.
model-engine/model_engine_server/db/migrations/alembic/versions/2026_06_16_1200-c4d5e6f7a8b9_add_status_reason_column.py Adds nullable Text column status_reason to the endpoints table with correct down_revision chaining to the prior migration (a1b2c3d4e5f6). Upgrade and downgrade are both present and correct.
model-engine/model_engine_server/infra/repositories/db_model_endpoint_record_repository.py Adds status_reason to create/update paths. Uses dict_not_none to skip None (preserve existing value) and clear_status_reason=True to explicitly NULL out the column — design is sound and consistent with other nullable fields.
model-engine/model_engine_server/domain/entities/model_endpoint_entity.py Adds status_reason: Optional[str] = None to ModelEndpointRecord entity — minimal, correct change.
model-engine/model_engine_server/common/dtos/model_endpoints.py Adds status_reason to GetModelEndpointV1Response; flows through to ListModelEndpointsV1Response automatically since it reuses the same DTO.
model-engine/model_engine_server/domain/use_cases/model_endpoint_use_cases.py Maps model_endpoint.record.status_reason into the response DTO — single-line, correct.
model-engine/model_engine_server/infra/services/live_model_endpoint_service.py Sets a hardcoded user-safe status_reason on infra-delete failure — correct, isolated change.
model-engine/tests/unit/domain/test_model_endpoint_use_cases.py Adds a focused test that verifies status_reason flows through GetModelEndpointByIdV1UseCase into the API response.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Builder as LiveEndpointBuilderService
    participant Repo as DbModelEndpointRecordRepository
    participant DB as endpoints table
    participant API as GET /model-endpoints/{id}

    Builder->>Repo: "update(status=UPDATE_IN_PROGRESS)"
    Repo->>DB: "SET endpoint_status='UPDATE_IN_PROGRESS'"
    Note over DB: status_reason unchanged (old value persists)

    alt Build succeeds
        Builder->>Repo: "update(status=READY, clear_status_reason=True)"
        Repo->>DB: "SET endpoint_status='READY', status_reason=NULL"
    else Build fails
        Builder->>Builder: _status_reason_from_error(error)
        Note over Builder: DomainException → message; else generic string
        Builder->>Repo: "update(status=UPDATE_FAILED, status_reason=reason)"
        Repo->>DB: "SET endpoint_status='UPDATE_FAILED', status_reason=reason"
    end

    API->>DB: SELECT status, status_reason FROM endpoints
    DB-->>API: "{status, status_reason}"
    API-->>API: "GetModelEndpointV1Response(status_reason=...)"
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Builder as LiveEndpointBuilderService
    participant Repo as DbModelEndpointRecordRepository
    participant DB as endpoints table
    participant API as GET /model-endpoints/{id}

    Builder->>Repo: "update(status=UPDATE_IN_PROGRESS)"
    Repo->>DB: "SET endpoint_status='UPDATE_IN_PROGRESS'"
    Note over DB: status_reason unchanged (old value persists)

    alt Build succeeds
        Builder->>Repo: "update(status=READY, clear_status_reason=True)"
        Repo->>DB: "SET endpoint_status='READY', status_reason=NULL"
    else Build fails
        Builder->>Builder: _status_reason_from_error(error)
        Note over Builder: DomainException → message; else generic string
        Builder->>Repo: "update(status=UPDATE_FAILED, status_reason=reason)"
        Repo->>DB: "SET endpoint_status='UPDATE_FAILED', status_reason=reason"
    end

    API->>DB: SELECT status, status_reason FROM endpoints
    DB-->>API: "{status, status_reason}"
    API-->>API: "GetModelEndpointV1Response(status_reason=...)"
Loading

Reviews (2): Last reviewed commit: "fix: remove dead status_reason write on ..." | Re-trigger Greptile

When an endpoint build fails (status UPDATE_FAILED), the cause was only logged
and discarded — the GET model-endpoint API exposed status but no reason, so
downstream consumers (e.g. SGP) could only show a generic failure message.

Thread a status_reason through the stack so the failure cause is persisted and
returned by the API:
- add nullable status_reason column to the endpoints table (+ Alembic migration)
- add status_reason to the ORM model, ModelEndpointRecord entity, the ORM->entity
  translation, and the create/update repository methods
- expose status_reason on GetModelEndpointV1Response and map it in the use case
- capture the cause at the UPDATE_FAILED sites (sanitized build error, image-build
  failure, infra-delete failure) and clear it when an endpoint returns to READY

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Don't leak raw infra error text: only surface str(error) when it's a
  DomainException (the codebase's own user-facing errors); other exceptions
  (k8s/AWS/DB) fall back to a generic message. Full detail is still logged.
- Replace the empty-string-as-clear convention with an explicit
  clear_status_reason flag so status_reason=None means "unchanged" like every
  other field in update_model_endpoint_record.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread model-engine/model_engine_server/infra/services/live_endpoint_builder_service.py Outdated
The image-build-failure path wrote a clean status_reason, then raised
DockerBuildFailedException — a DomainException whose raw message (including the
internal endpoint_id) was surfaced by the outer except handler, overwriting the
clean message. Drop the redundant write and instead make the exception message
itself the user-safe guidance (it becomes the status_reason via the outer
handler); the endpoint_id is logged separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant