fix(watch): guard against nil http_code on gRPC stream errors (fixes #222)#223
fix(watch): guard against nil http_code on gRPC stream errors (fixes #222)#223pubyun wants to merge 1 commit into
Conversation
A watch over a gRPC stream can report an error with no HTTP status, e.g.
`{"error":{"grpc_code":14,"message":"...EOF"}}` on a transport / stream
error. The check `body.error.http_code >= 500` then evaluated `nil >= 500`,
raising "attempt to compare nil with number" inside the watch coroutine. That
crashes the watch read, skips cancel_watch, and leaks the watcher on the etcd
side; a caller that immediately restarts the watch with no backoff (e.g. APISIX
config_etcd run_watch) turns this into a tight watch-recreate loop that can
drive etcd mvcc_watcher_total into the millions and OOM it.
Parse http_code defensively and treat a missing code (transport / stream error)
or any 5xx as an endpoint failure: report_failure and return a graceful error so
the connection is closed and rebuilt and the watcher is cancelled properly.
Errors that do carry an http_code (e.g. 4xx) keep falling through as before.
Fixes api7#222.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Peng Yong <ppyy@netegn.com>
|
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughIn Watch stream error nil-safety fix
Estimated code review effort🎯 2 (Simple) | ⏱️ ~5 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Problem
A watch over a gRPC stream can surface an error object that has no HTTP status.
On a transport / stream error etcd's grpc-gateway returns something like:
{"error":{"grpc_code":14,"message":"error reading from server: EOF"}}The watch read loop checks:
When
http_codeisnilthis evaluatesnil >= 500, raisingattempt to compare nil with numberinside the watch coroutine.Impact
The error aborts the watch read mid-flight, so
cancel_watchis skipped and thewatcher leaks on the etcd side. A caller that restarts the watch immediately
with no backoff (for example APISIX
config_etcd.lua'srun_watchviangx.timer.at(0, ...)) then spins in a tight watch-recreate loop. In a productionAPISIX 3.x cluster this drove etcd
mvcc_watcher_totalinto the millions andOOM-crashlooped the etcd pods.
Fix
Parse
http_codedefensively and treat a missing code (transport / streamerror) or any 5xx as an endpoint failure —
report_failure+ return a gracefulerror so the connection is closed, rebuilt, and the watcher cancelled properly.
Errors that do carry an
http_code(e.g. a 4xx) keep falling through to thecaller exactly as before.
Testing
compare nil with numbercrashes dropped to 0, stream EOFs take thegraceful-error path, etcd watcher count fell back to ≈ number of streams
(single digits), and etcd memory went from hitting the limit back to ~40Mi.
stream EOF mid-watch, which the
Test::Nginxsuite (talking to a real etcd)can't reproduce deterministically. Happy to add one if maintainers can point at
an existing pattern for injecting a malformed /
error-only watch chunk.Fixes #222.
Summary by CodeRabbit