[superlog] Fix 65s basket hang by resetting connected state on Redpanda send failure by superlog-app[bot] · Pull Request #469 · databuddy-analytics/Databuddy

superlog-app · 2026-06-11T22:13:25Z

Summary

When Redpanda becomes unavailable, the basket analytics ingestion service was blocking every /batch, /vitals, and /track request for ~65 seconds before falling back to ClickHouse. This affected all customer websites using the analytics SDK.

The root cause is in apps/basket/src/lib/producer.ts. After a KafkaSendError, the producer state update only set connectionFailed: true but left connected: true and lastRetry: 0. The connect() helper short-circuits to return true when connected is set, so every subsequent concurrent request bypassed the reconnect cooldown and attempted another kafka.send(). With idempotent: true on the KafkaJS producer, these sends queue behind an internal lock. KafkaJS's lock timeout is hardcoded at ~65,536 ms, so 200–1700+ queued requests each waited the full lock timeout before receiving a KafkaJSLockTimeout and falling back to ClickHouse.

The KafkaConnectionError path already correctly set connected: false and lastRetry: Date.now() — this fix mirrors that same pattern in the KafkaSendError path. After the first failed send, all subsequent requests immediately skip Kafka and buffer to ClickHouse without any lock contention.

Alternatively, disabling idempotent: true on the KafkaJS producer would also eliminate the lock mechanism entirely (analytics is tolerant of rare duplicates), which could be a simpler long-term fix.

Incident on Superlog

Was this PR helpful? Leave feedback — goes straight to the Superlog team.

Summary by cubic

Fixes 65s request hangs in basket analytics when Redpanda is down by resetting the producer’s connected state after a send failure. Requests now skip Kafka immediately and buffer to ClickHouse.

Bug Fixes
- On KafkaSendError, set connected: false and lastRetry = Date.now() (mirrors KafkaConnectionError).
- Stops connect() short-circuit and KafkaJS idempotent send lock buildup, removing the ~65s wait.
- Impact: /batch, /vitals, and /track no longer block; events buffer to ClickHouse until Redpanda recovers.

^{Written for commit 01906e6. Summary will update on new commits.}

…da send failure

vercel · 2026-06-11T22:13:30Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
databuddy-status	Ready	Preview, Comment	Jun 11, 2026 10:14pm

2 Skipped Deployments

Project	Deployment	Actions	Updated (UTC)
dashboard	Skipped		Jun 11, 2026 10:14pm
documentation	Skipped		Jun 11, 2026 10:14pm

unkey-deploy · 2026-06-11T22:13:56Z

The latest updates on your projects. Learn more about Unkey Deploy

Name	Status	Preview	Inspect	Updated (UTC)
api (preview)	Ready	Visit Preview	Inspect	Jun 11, 2026 10:14pm

greptile-apps · 2026-06-11T22:16:00Z

Greptile Summary

This PR fixes a production hang where Redpanda send failures caused all subsequent analytics ingestion requests to block for ~65 seconds before falling back to ClickHouse. The root cause was that KafkaSendError left connected: true in producer state, causing connect() to short-circuit and send requests to queue behind KafkaJS's internal idempotent-producer lock (timeout ~65 536 ms).

The fix adds connected: false and lastRetry: Date.now() to the KafkaSendError state update, exactly mirroring the existing KafkaConnectionError handling so all subsequent requests immediately respect the 60-second reconnect cooldown and fall through to ClickHouse.
One side-effect: shutDown now skips kafka.disconnect() when connected is false after a send failure, since it gates the disconnect call on post.connected. The underlying KafkaJS producer will remain open until process exit in that scenario.

Confidence Score: 4/5

The change is safe to merge — the two added lines directly fix the documented production hang and introduce no new failure modes on the hot path.

The core fix is correct and well-scoped: connected: false + lastRetry: Date.now() in the KafkaSendError handler breaks the tight loop that caused lock-queue buildup. The only trade-off introduced is that shutDown now skips kafka.disconnect() when a send failure preceded graceful shutdown, because the disconnect gate reads post.connected. This is a narrow, low-impact path (process exit cleans up OS connections anyway), but it is a real behavioral change worth revisiting if the shutdown sequence ever needs to be more precise.

apps/basket/src/lib/producer.ts — specifically the shutDown function and its post.connected guard.

Important Files Changed

Filename	Overview
apps/basket/src/lib/producer.ts	Adds connected: false and lastRetry: Date.now() to the KafkaSendError handler, mirroring the KafkaConnectionError path and preventing the 65s lock-timeout hang. Minor regression: shutDown now skips kafka.disconnect() when a send error precedes graceful shutdown.

Sequence Diagram

sequenceDiagram
    participant R as Request
    participant SV as sendViaKafka
    participant C as connect()
    participant K as KafkaJS Producer
    participant CH as ClickHouse Buffer

    Note over R,CH: After fix — KafkaSendError path
    R->>SV: sendViaKafka(topic, messages)
    SV->>C: connect()
    C->>C: "s.connected == false, check cooldown"
    C-->>SV: return false (within 60s cooldown)
    SV->>CH: bufferAll(fallbackEvents)

    Note over R,CH: Before fix — KafkaSendError path (hang)
    R->>SV: sendViaKafka(topic, messages)
    SV->>C: connect()
    C->>C: "s.connected == true, return true immediately"
    SV->>K: kafka.send() queues behind idempotent lock
    K-->>SV: KafkaJSLockTimeout after ~65s
    SV->>CH: bufferAll(fallbackEvents)

Comments Outside Diff (1)

apps/basket/src/lib/producer.ts, line 423-437 (link)

kafka.disconnect() silently skipped after send failure

shutDown reads post.connected to decide whether to disconnect the underlying KafkaJS producer. Before this PR, a KafkaSendError left connected: true, so shutdown correctly called kafka.disconnect(). Now that the error handler sets connected: false, a graceful shutdown triggered while Redpanda is down (or during the 60-second cooldown window) will skip the disconnect entirely, leaving the KafkaJS producer's transport-level connection open until the process exits. The fix for the hang is correct, but the shutdown path should also account for whether the KafkaJS producer was ever successfully connected — a kafkaEverConnected flag or checking connectionFailed in addition to connected — so the disconnect call isn't gated solely on the application-level connected state flag.

_{Reviews (1): Last reviewed commit: "[superlog] Fix 65s basket hang by resett..." | Re-trigger Greptile}

[superlog] Fix 65s basket hang by resetting connected state on Redpan…

01906e6

…da send failure

vercel Bot temporarily deployed to Preview – dashboard June 11, 2026 22:13 Inactive

vercel Bot temporarily deployed to Preview – documentation June 11, 2026 22:13 Inactive

unkey-deploy Bot deployed to links - preview June 11, 2026 22:13 View deployment

vercel Bot deployed to Preview – databuddy-status June 11, 2026 22:14 View deployment

unkey-deploy Bot deployed to api - preview June 11, 2026 22:14 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[superlog] Fix 65s basket hang by resetting connected state on Redpanda send failure#469

[superlog] Fix 65s basket hang by resetting connected state on Redpanda send failure#469
superlog-app[bot] wants to merge 1 commit into
stagingfrom
superlog/fix-basket-kafka-send-connected-reset

superlog-app Bot commented Jun 11, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

vercel Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

unkey-deploy Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 11, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

superlog-app Bot commented Jun 11, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by cubic

Uh oh!

vercel Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

unkey-deploy Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

superlog-app Bot commented Jun 11, 2026 •

edited by cubic-dev-ai Bot

Loading

vercel Bot commented Jun 11, 2026 •

edited

Loading

unkey-deploy Bot commented Jun 11, 2026 •

edited

Loading

greptile-apps Bot commented Jun 11, 2026 •

edited

Loading