Skip to content

[DO NOT MERGE][MLI-7219] feat(k8s-cache): log + metric on cacher Redis write failure#843

Draft
lorenzo-norcini-scale wants to merge 1 commit into
mainfrom
lorenzonorcini/mli-7219-surface-cacher-redis-write-failures-instead-of-returning
Draft

[DO NOT MERGE][MLI-7219] feat(k8s-cache): log + metric on cacher Redis write failure#843
lorenzo-norcini-scale wants to merge 1 commit into
mainfrom
lorenzonorcini/mli-7219-surface-cacher-redis-write-failures-instead-of-returning

Conversation

@lorenzo-norcini-scale

Copy link
Copy Markdown
Collaborator

Summary

When the k8s cacher fails to write endpoint status to Redis (bad auth, network partition, expired credentials), the failure was observable nowhere — no log, no metric. Entries then expire (60s TTL) and the Gateway reports endpoint status as unknown.

This wraps the write loop in ModelEndpointCacheWriteService.execute() to emit a logger.exception (traceback names the cause) and a new scale_launch.k8s_cache.redis_write_failure counter, then re-raise (visibility, not recovery — crash-loop behavior unchanged). Only the Redis write is wrapped; the k8s read and image cache stay outside.

Notes

  • Scoped to the cacher only. The Gateway's read-through fill and the builder's create-time seed are left alone.
  • Because the Gateway falls back to reading k8s on a cache miss, this metric is a cacher-refresh early-warning signal, not a direct measure of user-facing unknown.

Testing

New failure-branch test (raising repo → metric increments and exception propagates); happy-path test asserts the counter stays 0. Cache-service + redis-repo unit tests pass.

Resolves MLI-7219.

🤖 Generated with Claude Code

Cacher write failures were observable nowhere: no log, no metric. Entries
then expire and the Gateway reports endpoint status as `unknown`. Wrap the
write loop in execute() to emit a logger.exception + a new
scale_launch.k8s_cache.redis_write_failure counter, then re-raise.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lorenzo-norcini-scale lorenzo-norcini-scale force-pushed the lorenzonorcini/mli-7219-surface-cacher-redis-write-failures-instead-of-returning branch from 59fc93c to 7eb3546 Compare June 23, 2026 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant