Skip to content

[superlog] Fix 65s basket hang by resetting connected state on Redpanda send failure#469

Open
superlog-app[bot] wants to merge 1 commit into
stagingfrom
superlog/fix-basket-kafka-send-connected-reset
Open

[superlog] Fix 65s basket hang by resetting connected state on Redpanda send failure#469
superlog-app[bot] wants to merge 1 commit into
stagingfrom
superlog/fix-basket-kafka-send-connected-reset

Conversation

@superlog-app

@superlog-app superlog-app Bot commented Jun 11, 2026

Copy link
Copy Markdown

Summary

When Redpanda becomes unavailable, the basket analytics ingestion service was blocking every /batch, /vitals, and /track request for ~65 seconds before falling back to ClickHouse. This affected all customer websites using the analytics SDK.

The root cause is in apps/basket/src/lib/producer.ts. After a KafkaSendError, the producer state update only set connectionFailed: true but left connected: true and lastRetry: 0. The connect() helper short-circuits to return true when connected is set, so every subsequent concurrent request bypassed the reconnect cooldown and attempted another kafka.send(). With idempotent: true on the KafkaJS producer, these sends queue behind an internal lock. KafkaJS's lock timeout is hardcoded at ~65,536 ms, so 200–1700+ queued requests each waited the full lock timeout before receiving a KafkaJSLockTimeout and falling back to ClickHouse.

The KafkaConnectionError path already correctly set connected: false and lastRetry: Date.now() — this fix mirrors that same pattern in the KafkaSendError path. After the first failed send, all subsequent requests immediately skip Kafka and buffer to ClickHouse without any lock contention.

Alternatively, disabling idempotent: true on the KafkaJS producer would also eliminate the lock mechanism entirely (analytics is tolerant of rare duplicates), which could be a simpler long-term fix.

Incident on Superlog


Was this PR helpful? Leave feedback — goes straight to the Superlog team.


Summary by cubic

Fixes 65s request hangs in basket analytics when Redpanda is down by resetting the producer’s connected state after a send failure. Requests now skip Kafka immediately and buffer to ClickHouse.

  • Bug Fixes
    • On KafkaSendError, set connected: false and lastRetry = Date.now() (mirrors KafkaConnectionError).
    • Stops connect() short-circuit and KafkaJS idempotent send lock buildup, removing the ~65s wait.
    • Impact: /batch, /vitals, and /track no longer block; events buffer to ClickHouse until Redpanda recovers.

Written for commit 01906e6. Summary will update on new commits.

Review in cubic

@vercel

vercel Bot commented Jun 11, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
databuddy-status Ready Ready Preview, Comment Jun 11, 2026 10:14pm
2 Skipped Deployments
Project Deployment Actions Updated (UTC)
dashboard Skipped Skipped Jun 11, 2026 10:14pm
documentation Skipped Skipped Jun 11, 2026 10:14pm

@vercel vercel Bot temporarily deployed to Preview – dashboard June 11, 2026 22:13 Inactive
@vercel vercel Bot temporarily deployed to Preview – documentation June 11, 2026 22:13 Inactive
@unkey-deploy

unkey-deploy Bot commented Jun 11, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Unkey Deploy

Name Status Preview Inspect Updated (UTC)
api (preview) Ready Visit Preview Inspect Jun 11, 2026 10:14pm

@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a production hang where Redpanda send failures caused all subsequent analytics ingestion requests to block for ~65 seconds before falling back to ClickHouse. The root cause was that KafkaSendError left connected: true in producer state, causing connect() to short-circuit and send requests to queue behind KafkaJS's internal idempotent-producer lock (timeout ~65 536 ms).

  • The fix adds connected: false and lastRetry: Date.now() to the KafkaSendError state update, exactly mirroring the existing KafkaConnectionError handling so all subsequent requests immediately respect the 60-second reconnect cooldown and fall through to ClickHouse.
  • One side-effect: shutDown now skips kafka.disconnect() when connected is false after a send failure, since it gates the disconnect call on post.connected. The underlying KafkaJS producer will remain open until process exit in that scenario.

Confidence Score: 4/5

The change is safe to merge — the two added lines directly fix the documented production hang and introduce no new failure modes on the hot path.

The core fix is correct and well-scoped: connected: false + lastRetry: Date.now() in the KafkaSendError handler breaks the tight loop that caused lock-queue buildup. The only trade-off introduced is that shutDown now skips kafka.disconnect() when a send failure preceded graceful shutdown, because the disconnect gate reads post.connected. This is a narrow, low-impact path (process exit cleans up OS connections anyway), but it is a real behavioral change worth revisiting if the shutdown sequence ever needs to be more precise.

apps/basket/src/lib/producer.ts — specifically the shutDown function and its post.connected guard.

Important Files Changed

Filename Overview
apps/basket/src/lib/producer.ts Adds connected: false and lastRetry: Date.now() to the KafkaSendError handler, mirroring the KafkaConnectionError path and preventing the 65s lock-timeout hang. Minor regression: shutDown now skips kafka.disconnect() when a send error precedes graceful shutdown.

Sequence Diagram

sequenceDiagram
    participant R as Request
    participant SV as sendViaKafka
    participant C as connect()
    participant K as KafkaJS Producer
    participant CH as ClickHouse Buffer

    Note over R,CH: After fix — KafkaSendError path
    R->>SV: sendViaKafka(topic, messages)
    SV->>C: connect()
    C->>C: "s.connected == false, check cooldown"
    C-->>SV: return false (within 60s cooldown)
    SV->>CH: bufferAll(fallbackEvents)

    Note over R,CH: Before fix — KafkaSendError path (hang)
    R->>SV: sendViaKafka(topic, messages)
    SV->>C: connect()
    C->>C: "s.connected == true, return true immediately"
    SV->>K: kafka.send() queues behind idempotent lock
    K-->>SV: KafkaJSLockTimeout after ~65s
    SV->>CH: bufferAll(fallbackEvents)
Loading

Comments Outside Diff (1)

  1. apps/basket/src/lib/producer.ts, line 423-437 (link)

    P2 kafka.disconnect() silently skipped after send failure

    shutDown reads post.connected to decide whether to disconnect the underlying KafkaJS producer. Before this PR, a KafkaSendError left connected: true, so shutdown correctly called kafka.disconnect(). Now that the error handler sets connected: false, a graceful shutdown triggered while Redpanda is down (or during the 60-second cooldown window) will skip the disconnect entirely, leaving the KafkaJS producer's transport-level connection open until the process exits. The fix for the hang is correct, but the shutdown path should also account for whether the KafkaJS producer was ever successfully connected — a kafkaEverConnected flag or checking connectionFailed in addition to connected — so the disconnect call isn't gated solely on the application-level connected state flag.

Reviews (1): Last reviewed commit: "[superlog] Fix 65s basket hang by resett..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants