-
Notifications
You must be signed in to change notification settings - Fork 15
[configure] "\"rejected connection\" EOF Warnings in etcd Pod Logs Are TLS-Probe Noise" #381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jing2uo
wants to merge
1
commit into
main
Choose a base branch
from
kb/2026-02/rejected-connection-eof-warnings-in-etcd
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+133
−0
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
133 changes: 133 additions & 0 deletions
133
...utions/rejected_connection_EOF_Warnings_in_etcd_Pod_Logs_Are_TLS_Probe_Noise.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,133 @@ | ||
| --- | ||
| kind: | ||
| - Information | ||
| products: | ||
| - Alauda Container Platform | ||
| ProductsVersion: | ||
| - 4.1.0,4.2.x | ||
| --- | ||
| ## Overview | ||
|
|
||
| The etcd pods in ACP's control plane periodically log `rejected connection` warnings with an `"error":"EOF"` tail and a `remote-addr` that points at another control-plane node. The warning repeats at irregular intervals — every few seconds to a few minutes — and can add up to a visible fraction of the etcd log volume on a busy cluster. | ||
|
|
||
| A typical line looks like this: | ||
|
|
||
| ```text | ||
| {"level":"warn","ts":"2026-02-05T14:37:27.018363Z", | ||
| "caller":"embed/config_logging.go:169", | ||
| "msg":"rejected connection", | ||
| "remote-addr":"10.128.0.250:52864", | ||
| "server-name":"","error":"EOF"} | ||
| ``` | ||
|
|
||
| The observation is harmless on its own. Nothing in cluster health degrades, etcd quorum remains intact, request latencies do not rise, and no alert fires. The question operators ask is whether the line is a symptom of something that will fail later, or noise that can be filtered. | ||
|
|
||
| ## Root Cause | ||
|
|
||
| The log entry is emitted by etcd's embedded gRPC/TLS server when a client terminates a TCP connection immediately after the TLS handshake completes — before sending the first byte of application data. Concretely: | ||
|
|
||
| 1. The peer opens a TCP connection to etcd's serving port. | ||
| 2. The TLS handshake succeeds; certificates are exchanged and validated. | ||
| 3. The peer closes the TCP connection with an `EOF` without issuing any gRPC request. | ||
|
|
||
| From etcd's point of view, the handshake was fine but the client walked away. The `server-name` field is empty because the probe does not send an SNI header, and the `error` field records the `EOF` that the server read when it tried to consume the first gRPC frame. | ||
|
|
||
| The behaviour is driven by upstream `api-server` and `kube-controller-manager` components that perform TCP-level liveness/readiness checks against the etcd endpoint. They open a connection, complete the handshake to confirm the serving certificate is valid, and close without issuing a request — this is a cheap way to verify that etcd is accepting TLS traffic without consuming any API quota or writing to the raft log. | ||
|
|
||
| The same pattern can also arise from: | ||
|
|
||
| - Node-level health probes (e.g. a kubelet readiness probe against the etcd static pod's probe endpoint). | ||
| - External monitoring tools that port-scan the control plane. | ||
| - etcd's own peer-to-peer handshake when a member re-establishes a peer connection during raft leader elections. | ||
|
|
||
| None of these represent a data-plane fault. The etcd server side is simply reporting that a peer spoke TLS and then hung up. | ||
|
|
||
| ## Resolution | ||
|
|
||
| No corrective action is needed on a healthy cluster. The warnings are informational and do not indicate a broken TLS chain, an authentication failure, or a peer partition. Confirm cluster health once, then either ignore the warnings or filter them at the log-collection layer if they create noise in downstream tooling. | ||
|
|
||
| ### Confirm cluster health | ||
|
|
||
| Run the following checks and proceed only if all three pass: | ||
|
|
||
| ```bash | ||
| # 1. etcd endpoint health — every member reports HEALTH=true. | ||
| # etcdctl in the etcd pod requires the peer certificate bundle that | ||
| # the static pod mounts under /etc/kubernetes/pki/etcd/. | ||
| POD=$(kubectl -n kube-system get pod -l component=etcd \ | ||
| -o jsonpath='{.items[0].metadata.name}') | ||
| kubectl -n kube-system exec "$POD" -- etcdctl \ | ||
| --endpoints https://127.0.0.1:2379 \ | ||
| --cacert /etc/kubernetes/pki/etcd/ca.crt \ | ||
| --cert /etc/kubernetes/pki/etcd/peer.crt \ | ||
| --key /etc/kubernetes/pki/etcd/peer.key \ | ||
| endpoint health --cluster | ||
|
|
||
| # 2. API server readyz — non-empty output means each gate returns 'ok'. | ||
| kubectl get --raw=/readyz?verbose | ||
|
|
||
| # 3. No Degraded/Progressing conditions on the cluster's etcd operator | ||
| # (or equivalent control-plane component managed by the platform). | ||
| kubectl get co etcd -o jsonpath='{range .status.conditions[*]}{.type}={.status}{"\n"}{end}' 2>/dev/null \ | ||
| || kubectl -n kube-system get pod -l component=etcd \ | ||
| -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}' | ||
| ``` | ||
|
|
||
| If all three return healthy state, the `rejected connection` lines are safe to ignore. | ||
|
|
||
| ### Filter the warning at the collection layer | ||
|
|
||
| When the log volume from these lines is a problem for downstream indexing or alerting, drop them at ingest rather than editing etcd's verbosity (which also suppresses genuinely useful lines). A field-based filter is the simplest form: | ||
|
|
||
| ```yaml | ||
| # Example filter applied by the cluster's log forwarder / collector stack. | ||
| # Drops etcd pod entries whose message is exactly the probe-close warning. | ||
| - drop: | ||
| match: | ||
| kubernetes.container_name: etcd | ||
| message: 'rejected connection' | ||
| error: 'EOF' | ||
| ``` | ||
|
|
||
| Keep the filter conditional on `error=EOF` so that genuine TLS errors (expired certificate, unknown CA, version mismatch) still reach the log system — those produce a different `error` string and are **not** safe to ignore. | ||
|
|
||
| ### When the warning is not noise | ||
|
|
||
| A cluster that is actually unhealthy will show the same line **together** with one of the following, and this is the case that needs investigation: | ||
|
|
||
| - etcdctl endpoint health returns FAIL or timeout for any member. | ||
| - API server `/readyz` fails the `etcd` gate (`[-]etcd failed`). | ||
| - The frequency of `rejected connection` jumps by an order of magnitude after a control-plane event (certificate rotation, member restart, network flap) and does not recede. | ||
| - The error field in the log line is not `EOF` — values like `tls: bad certificate`, `remote error: tls: handshake failure`, or `x509: certificate has expired` indicate a real TLS fault and need the corresponding certificate or trust-bundle fix. | ||
|
|
||
| These signatures point at a different root cause and should be triaged independently; do not attribute them to the probe-noise pattern this note describes. | ||
|
|
||
| ## Diagnostic Steps | ||
|
|
||
| Count the warnings over a bounded window to distinguish baseline probe noise from a real burst: | ||
|
|
||
| ```bash | ||
| POD=$(kubectl -n kube-system get pod -l component=etcd \ | ||
| -o jsonpath='{.items[0].metadata.name}') | ||
| kubectl -n kube-system logs --since=10m "$POD" \ | ||
| | grep -c '"msg":"rejected connection"' | ||
| ``` | ||
|
|
||
| On a steady-state cluster the count is proportional to the number of probers times the number of etcd members — tens to low hundreds of entries over a ten-minute window is typical. A count three or more orders of magnitude higher, or a sudden change between consecutive windows, is the signal to investigate further. | ||
|
|
||
| Identify which peers are probing. The `remote-addr` field in the warning carries the source IP: | ||
|
|
||
| ```bash | ||
| kubectl -n kube-system logs --since=10m "$POD" \ | ||
| | grep '"msg":"rejected connection"' \ | ||
| | sed -n 's/.*"remote-addr":"\([^"]*\)".*/\1/p' \ | ||
| | awk -F: '{print $1}' | sort | uniq -c | sort -rn | ||
| ``` | ||
|
|
||
| Each IP should resolve to a control-plane node or a known monitoring endpoint: | ||
|
|
||
| ```bash | ||
| kubectl get node -o wide | awk '{print $1, $6}' | ||
| ``` | ||
|
|
||
| If an IP cannot be identified, it may be an external scanner that happens to reach the etcd serving port — tighten the network policy guarding the control-plane nodes, since etcd's serving port should never be exposed outside the cluster's management network. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overstated source attribution for the probe pattern
This section is too definitive and likely inaccurate as written.
kube-controller-manageris not typically a direct etcd TCP/TLS prober in standard control-plane setups, andkube-apiserveretcd interactions are generally request-level rather than “handshake-only then close.” Please soften this to “possible sources” unless you have packet/process evidence.Suggested wording adjustment
📝 Committable suggestion
🤖 Prompt for AI Agents