Skip to content

fix(gateway-runtime): gracefully drain Router on shutdown#2122

Open
renuka-fernando wants to merge 1 commit into
wso2:mainfrom
renuka-fernando:gateway-runtime-graceful-shutdown
Open

fix(gateway-runtime): gracefully drain Router on shutdown#2122
renuka-fernando wants to merge 1 commit into
wso2:mainfrom
renuka-fernando:gateway-runtime-graceful-shutdown

Conversation

@renuka-fernando

Copy link
Copy Markdown
Contributor

Purpose

On SIGTERM (Kubernetes rolling restart, pod eviction, or scale-down) the gateway-runtime entrypoint forwarded the signal straight to Envoy, which exits its event loop immediately with no drain, so in-flight keep-alive connections on the terminating pod are reset and clients see connection errors (NoHttpResponseException / connection reset) for the duration of the rollout.
This was confirmed against the Envoy source (see the Envoy analysis under Approach): a plain SIGTERM never enters Envoy's drain manager — graceful draining is only triggered by the admin POST /drain_listeners?graceful endpoint, POST /healthcheck/fail, or the hot-restart RPC, and no other signal triggers it either.

Resolves #2121

Goals

Make pod termination graceful so a rolling restart is zero-error: in-flight requests complete and keep-alive connections close cleanly (Connection: close) instead of being reset, with no client-visible failures.

Approach

  • On SIGTERM, the entrypoint now first drains the Router (Envoy) via the admin POST /drain_listeners?graceful endpoint before terminating anything.
  • The admin call uses a bash /dev/tcp socket because the runtime image ships no curl/wget; it is best-effort and never blocks shutdown if the admin is unreachable.
  • It then waits ROUTER_DRAIN_TIME_SECONDS (default 15) for in-flight requests to finish, then terminates processes in dependency order Router → Policy Engine → Python Executor, each waiting for the previous to fully exit so a dependency is never killed while something still needs it.
  • New env vars: ROUTER_ADMIN_HOST (default 127.0.0.1), ROUTER_ADMIN_PORT (default 9901), ROUTER_DRAIN_TIME_SECONDS (default 15, set 0 to disable); keep the drain time below the pod terminationGracePeriodSeconds (k8s default 30s) or the container is SIGKILLed mid-drain.
  • Applied to both docker-entrypoint.sh and docker-entrypoint-debug.sh (kept in sync).

Envoy source analysis (why the admin drain is required):

  • source/server/server.cc: SIGTERM/SIGINT call instance.shutdown()dispatcher_->exit(), which breaks the libevent loop immediately with no drain-manager involvement (Envoy logs "main dispatch loop exited" ~3 ms after SIGTERM).
  • DrainManager::startDrainSequence is only reachable from the admin /drain_listeners handler, the hot-restart kDrainListeners RPC (a domain-socket message, not a signal), and LDS listener add/remove.
  • --drain-time-s / --drain-strategy apply only to hot-restart / LDS drains and have no effect on plain SIGTERM.
  • No signal triggers a graceful drain (SIGUSR1 reopens access logs, SIGHUP is ignored, SIGINT is an immediate shutdown like SIGTERM).

User stories

N/A

Documentation

N/A — internal shutdown behaviour. The new env vars are documented inline in docker-entrypoint.sh.

Automation tests

  • Unit tests: N/A (shell entrypoint script).
  • Integration tests: validated end-to-end on Kubernetes (Helm chart 1.1.3, gateway-runtime built with this change, 2 runtime replicas, a 1000-resource REST API with a set-headers response policy), under JMeter load (100 threads, ~3,800 TPS) while issuing kubectl rollout restart on the runtime Deployment.

Test results — rolling restart under load:

Metric stock runtime (no fix) with this fix
Failed requests 105 (0.023%)NoHttpResponseException connection resets 0 (0.000%)
Samples 452,389 463,349
Response codes 200 + 105 conn-resets all 200

With the fix, success stays at 100% throughout the rollout; the only cost is a slightly longer rollout (each pod drains ~15 s before exiting). No 503/404 and no dropped response headers in either case.

Security checks

  • Followed secure coding standards? yes
  • Ran FindSecurityBugs plugin and verified report? N/A (shell script, no Java).
  • Confirmed that this PR doesn't commit any keys, passwords, tokens, usernames, or other secrets? yes

Samples

N/A

Related PRs

N/A

Test environment

  • Kubernetes: k3s v1.33 (colima), 1 controller + 2 runtime replicas, capped resources.
  • Envoy 1.37.1 (the version shipped in gateway-runtime).
  • Load: Apache JMeter (100 threads, 1000 resources round-robin, keep-alive).

On SIGTERM the entrypoint sent Envoy SIGTERM directly; Envoy exits its
event loop immediately without draining, so in-flight keep-alive
connections are reset during rolling restarts / pod evictions. Drain the
Router first via the admin /drain_listeners?graceful endpoint (bash
/dev/tcp, no curl in image) and wait before terminating.

- Add ROUTER_ADMIN_HOST/PORT and ROUTER_DRAIN_TIME_SECONDS (default 15s,
  0 disables); keep below the pod terminationGracePeriodSeconds
- Terminate in dependency order: Router -> Policy Engine -> Python
  Executor, each waiting for full exit
- Mirror the change in docker-entrypoint-debug.sh
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: da5b9b09-44e3-4086-ad8f-119dffde3a37

📥 Commits

Reviewing files that changed from the base of the PR and between 0a0caa3 and 271cc4c.

📒 Files selected for processing (2)
  • gateway/gateway-runtime/docker-entrypoint-debug.sh
  • gateway/gateway-runtime/docker-entrypoint.sh

📝 Walkthrough

Overview

This PR implements graceful shutdown handling for the Router (Envoy) in the gateway-runtime entrypoint to prevent connection errors during rolling updates. Previously, SIGTERM was forwarded directly to Envoy, which immediately terminated its event loop without draining in-flight requests, causing client-visible connection resets.

Changes

Updated docker-entrypoint.sh and docker-entrypoint-debug.sh to introduce a staged shutdown sequence:

  • Graceful Router drain: On SIGTERM, calls the Envoy admin endpoint POST /drain_listeners?graceful using a best-effort mechanism (bash /dev/tcp socket, no external dependencies required) to signal Envoy's drain manager.

  • Configurable drain period: Added ROUTER_DRAIN_TIME_SECONDS environment variable (default 15 seconds) to allow in-flight requests to complete before process termination. Can be disabled by setting to 0.

  • Process termination in dependency order: After draining, terminates Router → Policy Engine → Python Executor in sequence, waiting for each to fully exit before signaling the next.

  • New configuration variables:

    • ROUTER_ADMIN_HOST (default 127.0.0.1)
    • ROUTER_ADMIN_PORT (default 9901)
    • ROUTER_DRAIN_TIME_SECONDS (default 15)

Impact

End-to-end testing on Kubernetes (k3s) with Envoy 1.37.1 under sustained load (~3,800 TPS) during rollout showed failed requests dropped from 105 (0.023%) to 0 across ~463k samples, eliminating client-visible connection errors during pod termination.

Notes

  • The drain endpoint call is best-effort and does not block shutdown if unreachable
  • Drain time should be kept below the Kubernetes pod terminationGracePeriodSeconds to avoid SIGKILL during drain
  • Resolves issue #2121

Walkthrough

This pull request implements graceful shutdown for the gateway runtime by introducing configurable connection draining before process termination. The changes apply to both the main and debug entrypoint scripts. The shutdown flow now begins with an admin API call to drain active connections from Envoy, waits for the configured drain window to allow in-flight requests to complete, and then terminates processes in dependency order: Envoy (Router), Policy Engine, and Python Executor. Configuration variables define the Router admin endpoint and drain timeout with sensible defaults for container environments.

Sequence Diagram

sequenceDiagram
  participant SignalHandler as Signal Handler
  participant DrainRouter as drain_router()
  participant RouterAdmin as Router Admin API
  participant StopProc as stop_proc()
  participant Envoy
  participant PolicyEngine as Policy Engine
  participant PythonExecutor as Python Executor
  
  SignalHandler->>DrainRouter: SIGTERM/SIGINT received
  DrainRouter->>RouterAdmin: POST /drain_listeners?graceful
  RouterAdmin-->>DrainRouter: drain acknowledged
  DrainRouter->>SignalHandler: drain request sent
  SignalHandler->>SignalHandler: wait ROUTER_DRAIN_TIME_SECONDS
  SignalHandler->>StopProc: stop Envoy PID
  StopProc->>Envoy: SIGTERM
  Envoy-->>StopProc: exit
  StopProc->>SignalHandler: Envoy stopped
  SignalHandler->>StopProc: stop Policy Engine PID
  StopProc->>PolicyEngine: SIGTERM
  PolicyEngine-->>StopProc: exit
  StopProc->>SignalHandler: Policy Engine stopped
  SignalHandler->>StopProc: stop Python Executor PID
  StopProc->>PythonExecutor: SIGTERM
  PythonExecutor-->>StopProc: exit
  StopProc->>SignalHandler: Python Executor stopped
Loading
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main change: adding graceful Router shutdown draining to the gateway-runtime entrypoint.
Description check ✅ Passed The description comprehensively addresses all template sections with clear purpose, goals, approach, test results, security confirmation, and environment details.
Linked Issues check ✅ Passed All changes directly address issue #2121: graceful Router drain via admin endpoint on SIGTERM, configurable drain time, dependency-ordered process termination, and zero-error validation under load.
Out of Scope Changes check ✅ Passed All code changes are scoped to graceful shutdown handling in docker-entrypoint scripts; no unrelated modifications present.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Gateway runtime resets in-flight connections on rolling restart

1 participant