-
Notifications
You must be signed in to change notification settings - Fork 76
feat: record and expose status_reason on model endpoints #839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| """add status_reason column | ||
|
|
||
| Revision ID: c4d5e6f7a8b9 | ||
| Revises: a1b2c3d4e5f6 | ||
| Create Date: 2026-06-16 12:00:00.000000 | ||
|
|
||
| """ | ||
| import sqlalchemy as sa | ||
| from alembic import op | ||
|
|
||
| # revision identifiers, used by Alembic. | ||
| revision = 'c4d5e6f7a8b9' | ||
| down_revision = 'a1b2c3d4e5f6' | ||
| branch_labels = None | ||
| depends_on = None | ||
|
|
||
|
|
||
| def upgrade() -> None: | ||
| op.add_column( | ||
| 'endpoints', | ||
| sa.Column('status_reason', sa.Text, nullable=True), | ||
| schema='hosted_model_inference', | ||
| ) | ||
|
|
||
|
|
||
| def downgrade() -> None: | ||
| op.drop_column( | ||
| 'endpoints', | ||
| 'status_reason', | ||
| schema='hosted_model_inference', | ||
| ) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -83,6 +83,16 @@ | |
|
|
||
| INITIAL_K8S_CACHE_TTL_SECONDS: int = 180 | ||
| MAX_IMAGE_TAG_LEN = 128 | ||
| # Cap the persisted failure reason so a verbose traceback/string can't blow up the column. | ||
| MAX_STATUS_REASON_LEN = 500 | ||
|
|
||
|
|
||
| def _sanitize_status_reason(message: str) -> str: | ||
| """Collapse whitespace and cap length for a status_reason persisted on an endpoint.""" | ||
| collapsed = " ".join(message.split()) | ||
| if len(collapsed) > MAX_STATUS_REASON_LEN: | ||
| return collapsed[: MAX_STATUS_REASON_LEN - 1].rstrip() + "…" | ||
| return collapsed | ||
|
|
||
| RESTRICTED_ENV_VARS_KEYS = { | ||
| "BASE": [ | ||
|
|
@@ -342,15 +352,19 @@ async def build_endpoint( | |
| model_endpoint_id=endpoint_id, | ||
| destination=create_or_update_response.destination, | ||
| status=ModelEndpointStatus.READY, | ||
| # Clear any reason from a prior failed attempt now that we're healthy. | ||
| status_reason="", | ||
| ) | ||
|
|
||
| except Exception as error: # noqa | ||
| log_error("Failed endpoint build process!") | ||
| # Update status as failed endpoint creation on unhandled error | ||
| # Update status as failed endpoint creation on unhandled error, recording | ||
| # the cause so it can be surfaced to API consumers. | ||
| try: | ||
| await self.model_endpoint_record_repository.update_model_endpoint_record( | ||
| model_endpoint_id=endpoint_id, | ||
| status=ModelEndpointStatus.UPDATE_FAILED, | ||
| status_reason=_sanitize_status_reason(str(error)), | ||
| ) | ||
|
Comment on lines
+367
to
368
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The broad Prompt To Fix With AIThis is a comment left during a code review.
Path: model-engine/model_engine_server/infra/services/live_endpoint_builder_service.py
Line: 367-368
Comment:
**Unfiltered `str(error)` may leak sensitive infrastructure details**
The broad `except Exception` block catches all errors from infrastructure operations (Kubernetes API calls, AWS SDK calls, database operations, etc.). Passing `str(error)` directly as `status_reason` can expose internal connection strings, internal IP addresses, resource names, or database schema details to API consumers via the public `GET /model-endpoint` response. The `_sanitize_status_reason` helper only collapses whitespace and caps length — it performs no content filtering. Consider catching specific known exception types and mapping them to safe, user-facing messages, or stripping exception types from a known-internal class hierarchy before persisting the message.
How can I resolve this? If you propose a fix, please make it concise. |
||
| except Exception as error_update: | ||
| log_error("Failed to update endpoint build status to FAILED") | ||
|
|
@@ -717,6 +731,8 @@ async def _build_image( | |
| await self.model_endpoint_record_repository.update_model_endpoint_record( | ||
| model_endpoint_id=build_endpoint_request.model_endpoint_record.id, | ||
| status=ModelEndpointStatus.UPDATE_FAILED, | ||
| status_reason="Image build failed. Check that the bundle's image " | ||
| "and dependencies are valid and accessible.", | ||
| ) | ||
|
|
||
| if s3_logs_location is not None: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The update method uses
Noneto mean "leave the existing value unchanged" and""to mean "clear to NULL". This is documented in a comment but differs from every other nullable parameter in the same method signature (e.g.,status,destination,metadata) whereNonealways means "do not update". A future caller who passesstatus_reason=Noneintending to clear the field will silently leave a stale failure reason. Consider using a dedicated sentinel (e.g., a module-levelCLEAR = object()) or a separateclear_status_reason: bool = Falseparameter to make the intent explicit.Prompt To Fix With AI
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!