Skip to content

fix(watch): guard against nil http_code on gRPC stream errors (fixes #222)#223

Open
pubyun wants to merge 1 commit into
api7:masterfrom
pubyun:fix-watch-nil-http-code
Open

fix(watch): guard against nil http_code on gRPC stream errors (fixes #222)#223
pubyun wants to merge 1 commit into
api7:masterfrom
pubyun:fix-watch-nil-http-code

Conversation

@pubyun

@pubyun pubyun commented Jun 29, 2026

Copy link
Copy Markdown

Problem

A watch over a gRPC stream can surface an error object that has no HTTP status.
On a transport / stream error etcd's grpc-gateway returns something like:

{"error":{"grpc_code":14,"message":"error reading from server: EOF"}}

The watch read loop checks:

elseif body.error and body.error.http_code >= 500 then

When http_code is nil this evaluates nil >= 500, raising
attempt to compare nil with number inside the watch coroutine.

Impact

The error aborts the watch read mid-flight, so cancel_watch is skipped and the
watcher leaks on the etcd side. A caller that restarts the watch immediately
with no backoff (for example APISIX config_etcd.lua's run_watch via
ngx.timer.at(0, ...)) then spins in a tight watch-recreate loop. In a production
APISIX 3.x cluster this drove etcd mvcc_watcher_total into the millions and
OOM-crashlooped the etcd pods.

Fix

Parse http_code defensively and treat a missing code (transport / stream
error) or any 5xx as an endpoint failure — report_failure + return a graceful
error so the connection is closed, rebuilt, and the watcher cancelled properly.
Errors that do carry an http_code (e.g. a 4xx) keep falling through to the
caller exactly as before.

local raw = body.error.http_code
local http_code = (type(raw) == "number" and raw)
                  or (type(raw) == "string" and tonumber(raw)) or nil
if http_code == nil or http_code >= 500 then
    health_check.report_failure(endpoint.http_host)
    return nil, endpoint.http_host .. ": "
                .. (body.error.http_status or body.error.message or "watch stream error")
end

Testing

  • Verified in production: with this patch applied (on 1.10.6) the
    compare nil with number crashes dropped to 0, stream EOFs take the
    graceful-error path, etcd watcher count fell back to ≈ number of streams
    (single digits), and etcd memory went from hitting the limit back to ~40Mi.
  • I did not add an automated test. The trigger is a transport-level gRPC
    stream EOF mid-watch, which the Test::Nginx suite (talking to a real etcd)
    can't reproduce deterministically. Happy to add one if maintainers can point at
    an existing pattern for injecting a malformed / error-only watch chunk.

Fixes #222.

Summary by CodeRabbit

  • Bug Fixes
    • Improved handling of watch stream errors so missing or non-numeric status codes no longer cause failures.
    • Server-side errors are now reported more reliably, with clearer fallback messages when status details are incomplete.
    • Endpoint health is marked more accurately when a watch request fails with a serious error.

A watch over a gRPC stream can report an error with no HTTP status, e.g.
`{"error":{"grpc_code":14,"message":"...EOF"}}` on a transport / stream
error. The check `body.error.http_code >= 500` then evaluated `nil >= 500`,
raising "attempt to compare nil with number" inside the watch coroutine. That
crashes the watch read, skips cancel_watch, and leaks the watcher on the etcd
side; a caller that immediately restarts the watch with no backoff (e.g. APISIX
config_etcd run_watch) turns this into a tight watch-recreate loop that can
drive etcd mvcc_watcher_total into the millions and OOM it.

Parse http_code defensively and treat a missing code (transport / stream error)
or any 5xx as an endpoint failure: report_failure and return a graceful error so
the connection is closed and rebuilt and the watcher is cancelled properly.
Errors that do carry an http_code (e.g. 4xx) keep falling through as before.

Fixes api7#222.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Peng Yong <ppyy@netegn.com>
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 267b59e9-9370-450b-94ca-69a7eeb918ab

📥 Commits

Reviewing files that changed from the base of the PR and between bbcaf8a and fb71360.

📒 Files selected for processing (1)
  • lib/resty/etcd/v3.lua

📝 Walkthrough

Walkthrough

In lib/resty/etcd/v3.lua, the read_watch function's error-handling block is updated to defensively parse body.error.http_code as a number or string-to-number, treating nil or values >= 500 as endpoint failures reported via health_check.report_failure, with a fallback to body.error.message when http_status is absent.

Watch stream error nil-safety fix

Layer / File(s) Summary
Nil-safe http_code parsing and error reporting
lib/resty/etcd/v3.lua
Replaces the direct body.error.http_code >= 500 comparison with defensive type parsing (number, string-to-number, or nil). Missing or 5xx codes trigger health_check.report_failure and return a formatted error using http_status or message as fallback, preventing the attempt to compare nil with number crash on gRPC stream errors like EOF.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
E2e Test Quality Review ⚠️ Warning No E2E regression test covers the nil http_code watch-error path; existing watch tests only cover normal events/timeout, so the fix isn’t validated end-to-end. Add an integration/E2E watch test that forces a stream error chunk without http_code (or equivalent fault injection) and asserts graceful error handling.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the nil http_code guard added for watch stream errors and matches the main fix.
Linked Issues check ✅ Passed The patch addresses the linked nil-compare crash and keeps stream errors on the normal failure path.
Out of Scope Changes check ✅ Passed The changes stay confined to the watch error-handling fix in lib/resty/etcd/v3.lua.
Security Check ✅ Passed Changed code only hardens watch error handling; it adds no secret logging, storage, auth bypass, or TLS/security config changes.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

watch: "attempt to compare nil with number" in v3.lua on etcd gRPC stream error (EOF) — body.error.http_code is nil

2 participants