Skip to content

[superlog] Fix KafkaSendError leaving connected=true, causing 65s lock-timeout spiral#467

Open
superlog-app[bot] wants to merge 1 commit into
mainfrom
superlog/fix-kafka-send-connected-state-reset
Open

[superlog] Fix KafkaSendError leaving connected=true, causing 65s lock-timeout spiral#467
superlog-app[bot] wants to merge 1 commit into
mainfrom
superlog/fix-kafka-send-connected-state-reset

Conversation

@superlog-app

@superlog-app superlog-app Bot commented Jun 11, 2026

Copy link
Copy Markdown

Summary

When the Redpanda (Kafka) broker rejects a produce request, the basket analytics ingestion service (/batch, /vitals) was blocking every subsequent request for ~65 seconds before falling back to ClickHouse. During the incident window, the KafkaJS internal producer lock queue grew from 180 to 690+ waiting requests, each blocking for the full lock-timeout period.

Root cause

In sendViaKafka, the KafkaSendError catch handler set connectionFailed: true but left connected: true and never updated lastRetry. The connect() effect short-circuits on s.connected === true — it never reaches the reconnection-cooldown check. So every subsequent request called kafka.send(), queued on the KafkaJS internal lock, and waited up to 65 seconds before timing out.

Fix

Set connected: false and lastRetry: Date.now() in the KafkaSendError state update. After a send failure, the next call to connect() will see connected === false, hit the cooldown guard (connectionFailed && Date.now() - lastRetry < reconnectCooldown), and immediately return false — routing all subsequent requests to the ClickHouse fallback without touching the Kafka lock.

Alternative approaches considered: (1) adding a circuit-breaker that counts lock-timeout errors and disables Kafka above a threshold — viable but more complex than correcting the state machine; (2) setting maxInFlightRequests: 1 on the idempotent producer — reduces lock contention under load but doesn't fix the root state-machine bug.

Incident on Superlog


Was this PR helpful? Leave feedback — goes straight to the Superlog team.


Summary by cubic

Fixes a state bug where send failures left connected=true, causing ~65s lock-timeouts and request pileups. We now mark the producer disconnected and trigger cooldown, so traffic immediately falls back to ClickHouse during broker rejects.

  • Bug Fixes
    • On KafkaSendError, set connected: false and lastRetry: Date.now().
    • connect() now hits the cooldown path after a failed send, avoiding producer lock buildup and timeouts on /batch and /vitals.

Written for commit 1275541. Summary will update on new commits.

Review in cubic

@vercel vercel Bot temporarily deployed to Preview – documentation June 11, 2026 14:09 Inactive
@vercel

vercel Bot commented Jun 11, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
databuddy-status Ready Ready Preview, Comment Jun 11, 2026 2:10pm
2 Skipped Deployments
Project Deployment Actions Updated (UTC)
dashboard Skipped Skipped Jun 11, 2026 2:10pm
documentation Skipped Skipped Jun 11, 2026 2:10pm

@unkey-deploy

unkey-deploy Bot commented Jun 11, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Unkey Deploy

Name Status Preview Inspect Updated (UTC)
api (preview) Ready Visit Preview Inspect Jun 11, 2026 2:10pm

@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a state-machine bug in apps/basket/src/lib/producer.ts where a KafkaSendError left connected: true in ProducerState, causing every subsequent request to queue on the KafkaJS internal lock and wait up to 65 seconds before falling back to ClickHouse. The fix correctly sets connected: false and lastRetry: Date.now() on send failure, so the connect() cooldown guard immediately routes all subsequent requests to the ClickHouse buffer.

  • The hot-path fix is correct and well-targeted: the connect() short-circuit on s.connected === true is now properly bypassed, and the connectionFailed + cooldown guard takes over as intended.
  • A regression is introduced in the shutdown path: shutDown gates kafka.disconnect() on post.connected, which is now false after any send failure. If the service shuts down while in the failed state, the underlying KafkaJS TCP socket is not explicitly closed.

Confidence Score: 3/5

The send-path fix is correct and directly addresses the incident, but the shutdown path now has a gap: if the service exits while Kafka is in the failed state, the underlying TCP connection is never closed.

The shutDown function checks post.connected before calling kafka.disconnect(). Because this PR now sets connected: false on a send failure, any graceful shutdown that occurs after a KafkaSendError will silently skip the disconnect call, leaving an open socket behind. The fix is correct and necessary for the hot-path, but it introduces this secondary gap in the shutdown code path that should be addressed before merging to production.

apps/basket/src/lib/producer.ts — specifically the shutDown effect around line 419, where the kafka.disconnect() guard depends on post.connected.

Important Files Changed

Filename Overview
apps/basket/src/lib/producer.ts State machine fix for KafkaSendError correctly sets connected=false and lastRetry=now, preventing the 65s lock-timeout spiral. However, the shutDown function guards kafka.disconnect() on post.connected, which is now false after a send failure — the underlying KafkaJS TCP connection is left open at process shutdown.

Sequence Diagram

sequenceDiagram
    participant Req as Incoming Request
    participant SvK as sendViaKafka()
    participant Conn as connect()
    participant Ref as ProducerState (Ref)
    participant KJS as KafkaJS Producer
    participant CH as ClickHouse Buffer

    Note over Req,CH: After fix — KafkaSendError path

    Req->>SvK: sendViaKafka(topic, messages)
    SvK->>Conn: connect()
    Conn->>Ref: get state
    Ref-->>Conn: "{connected: false, connectionFailed: true, lastRetry: T}"
    Note over Conn: connectionFailed && now - T < 60s → cooldown
    Conn-->>SvK: false (no lock acquired)
    SvK->>CH: bufferAll(fallbackEvents)

    Note over Req,CH: First failure that triggers state change

    Req->>SvK: sendViaKafka(topic, messages)
    SvK->>Conn: connect()
    Conn->>Ref: get state
    Ref-->>Conn: "{connected: true, connectionFailed: false}"
    Conn-->>SvK: true
    SvK->>KJS: kafka.send(...)
    KJS-->>SvK: KafkaSendError
    SvK->>Ref: "update {connectionFailed: true, connected: false, lastRetry: now}"
    SvK->>CH: bufferAll(fallbackEvents)
Loading

Comments Outside Diff (1)

  1. apps/basket/src/lib/producer.ts, line 419-438 (link)

    P1 Shutdown skips kafka.disconnect() after a send failure

    shutDown gates the kafka.disconnect() call on post.connected. Before this fix, connected remained true after a KafkaSendError, so disconnect always ran on shutdown. After this fix, connected is false after any send error — meaning if the service receives a graceful-shutdown signal while Kafka is in the failed state, the underlying KafkaJS TCP connection is never explicitly closed. The same connected flag is now serving two different roles: "should we attempt sends" (fixed correctly) and "does the transport need cleanup" (broken by this change). A separate tcpConnected boolean — set true on a successful kafka.connect() and only cleared to false in the disconnect() Effect — would decouple the two concerns and let both paths work correctly.

Reviews (1): Last reviewed commit: "[superlog] Fix KafkaSendError leaving co..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants