fix(gateway-runtime): gracefully drain Router on shutdown#2122
fix(gateway-runtime): gracefully drain Router on shutdown#2122renuka-fernando wants to merge 1 commit into
Conversation
On SIGTERM the entrypoint sent Envoy SIGTERM directly; Envoy exits its event loop immediately without draining, so in-flight keep-alive connections are reset during rolling restarts / pod evictions. Drain the Router first via the admin /drain_listeners?graceful endpoint (bash /dev/tcp, no curl in image) and wait before terminating. - Add ROUTER_ADMIN_HOST/PORT and ROUTER_DRAIN_TIME_SECONDS (default 15s, 0 disables); keep below the pod terminationGracePeriodSeconds - Terminate in dependency order: Router -> Policy Engine -> Python Executor, each waiting for full exit - Mirror the change in docker-entrypoint-debug.sh
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughOverviewThis PR implements graceful shutdown handling for the Router (Envoy) in the gateway-runtime entrypoint to prevent connection errors during rolling updates. Previously, SIGTERM was forwarded directly to Envoy, which immediately terminated its event loop without draining in-flight requests, causing client-visible connection resets. ChangesUpdated
ImpactEnd-to-end testing on Kubernetes (k3s) with Envoy 1.37.1 under sustained load (~3,800 TPS) during rollout showed failed requests dropped from 105 (0.023%) to 0 across ~463k samples, eliminating client-visible connection errors during pod termination. Notes
WalkthroughThis pull request implements graceful shutdown for the gateway runtime by introducing configurable connection draining before process termination. The changes apply to both the main and debug entrypoint scripts. The shutdown flow now begins with an admin API call to drain active connections from Envoy, waits for the configured drain window to allow in-flight requests to complete, and then terminates processes in dependency order: Envoy (Router), Policy Engine, and Python Executor. Configuration variables define the Router admin endpoint and drain timeout with sensible defaults for container environments. Sequence DiagramsequenceDiagram
participant SignalHandler as Signal Handler
participant DrainRouter as drain_router()
participant RouterAdmin as Router Admin API
participant StopProc as stop_proc()
participant Envoy
participant PolicyEngine as Policy Engine
participant PythonExecutor as Python Executor
SignalHandler->>DrainRouter: SIGTERM/SIGINT received
DrainRouter->>RouterAdmin: POST /drain_listeners?graceful
RouterAdmin-->>DrainRouter: drain acknowledged
DrainRouter->>SignalHandler: drain request sent
SignalHandler->>SignalHandler: wait ROUTER_DRAIN_TIME_SECONDS
SignalHandler->>StopProc: stop Envoy PID
StopProc->>Envoy: SIGTERM
Envoy-->>StopProc: exit
StopProc->>SignalHandler: Envoy stopped
SignalHandler->>StopProc: stop Policy Engine PID
StopProc->>PolicyEngine: SIGTERM
PolicyEngine-->>StopProc: exit
StopProc->>SignalHandler: Policy Engine stopped
SignalHandler->>StopProc: stop Python Executor PID
StopProc->>PythonExecutor: SIGTERM
PythonExecutor-->>StopProc: exit
StopProc->>SignalHandler: Python Executor stopped
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Purpose
On
SIGTERM(Kubernetes rolling restart, pod eviction, or scale-down) the gateway-runtime entrypoint forwarded the signal straight to Envoy, which exits its event loop immediately with no drain, so in-flight keep-alive connections on the terminating pod are reset and clients see connection errors (NoHttpResponseException/ connection reset) for the duration of the rollout.This was confirmed against the Envoy source (see the Envoy analysis under Approach): a plain
SIGTERMnever enters Envoy's drain manager — graceful draining is only triggered by the adminPOST /drain_listeners?gracefulendpoint,POST /healthcheck/fail, or the hot-restart RPC, and no other signal triggers it either.Resolves #2121
Goals
Make pod termination graceful so a rolling restart is zero-error: in-flight requests complete and keep-alive connections close cleanly (
Connection: close) instead of being reset, with no client-visible failures.Approach
SIGTERM, the entrypoint now first drains the Router (Envoy) via the adminPOST /drain_listeners?gracefulendpoint before terminating anything./dev/tcpsocket because the runtime image ships nocurl/wget; it is best-effort and never blocks shutdown if the admin is unreachable.ROUTER_DRAIN_TIME_SECONDS(default15) for in-flight requests to finish, then terminates processes in dependency order Router → Policy Engine → Python Executor, each waiting for the previous to fully exit so a dependency is never killed while something still needs it.ROUTER_ADMIN_HOST(default127.0.0.1),ROUTER_ADMIN_PORT(default9901),ROUTER_DRAIN_TIME_SECONDS(default15, set0to disable); keep the drain time below the podterminationGracePeriodSeconds(k8s default30s) or the container isSIGKILLed mid-drain.docker-entrypoint.shanddocker-entrypoint-debug.sh(kept in sync).Envoy source analysis (why the admin drain is required):
source/server/server.cc:SIGTERM/SIGINTcallinstance.shutdown()→dispatcher_->exit(), which breaks the libevent loop immediately with no drain-manager involvement (Envoy logs "main dispatch loop exited" ~3 ms afterSIGTERM).DrainManager::startDrainSequenceis only reachable from the admin/drain_listenershandler, the hot-restartkDrainListenersRPC (a domain-socket message, not a signal), and LDS listener add/remove.--drain-time-s/--drain-strategyapply only to hot-restart / LDS drains and have no effect on plainSIGTERM.SIGUSR1reopens access logs,SIGHUPis ignored,SIGINTis an immediate shutdown likeSIGTERM).User stories
N/A
Documentation
N/A — internal shutdown behaviour. The new env vars are documented inline in
docker-entrypoint.sh.Automation tests
set-headersresponse policy), under JMeter load (100 threads, ~3,800 TPS) while issuingkubectl rollout restarton the runtime Deployment.Test results — rolling restart under load:
NoHttpResponseExceptionconnection resetsWith the fix, success stays at 100% throughout the rollout; the only cost is a slightly longer rollout (each pod drains ~15 s before exiting). No
503/404and no dropped response headers in either case.Security checks
Samples
N/A
Related PRs
N/A
Test environment
v1.33(colima), 1 controller + 2 runtime replicas, capped resources.1.37.1(the version shipped in gateway-runtime).