You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When Redpanda becomes unavailable, the basket analytics ingestion service was blocking every /batch, /vitals, and /track request for ~65 seconds before falling back to ClickHouse. This affected all customer websites using the analytics SDK.
The root cause is in apps/basket/src/lib/producer.ts. After a KafkaSendError, the producer state update only set connectionFailed: true but left connected: true and lastRetry: 0. The connect() helper short-circuits to return true when connected is set, so every subsequent concurrent request bypassed the reconnect cooldown and attempted another kafka.send(). With idempotent: true on the KafkaJS producer, these sends queue behind an internal lock. KafkaJS's lock timeout is hardcoded at ~65,536 ms, so 200–1700+ queued requests each waited the full lock timeout before receiving a KafkaJSLockTimeout and falling back to ClickHouse.
The KafkaConnectionError path already correctly set connected: false and lastRetry: Date.now() — this fix mirrors that same pattern in the KafkaSendError path. After the first failed send, all subsequent requests immediately skip Kafka and buffer to ClickHouse without any lock contention.
Alternatively, disabling idempotent: true on the KafkaJS producer would also eliminate the lock mechanism entirely (analytics is tolerant of rare duplicates), which could be a simpler long-term fix.
Was this PR helpful? Leave feedback — goes straight to the Superlog team.
Summary by cubic
Fixes 65s request hangs in basket analytics when Redpanda is down by resetting the producer’s connected state after a send failure. Requests now skip Kafka immediately and buffer to ClickHouse.
Bug Fixes
On KafkaSendError, set connected: false and lastRetry = Date.now() (mirrors KafkaConnectionError).
Stops connect() short-circuit and KafkaJS idempotent send lock buildup, removing the ~65s wait.
Impact: /batch, /vitals, and /track no longer block; events buffer to ClickHouse until Redpanda recovers.
Written for commit 01906e6. Summary will update on new commits.
This PR fixes a production hang where Redpanda send failures caused all subsequent analytics ingestion requests to block for ~65 seconds before falling back to ClickHouse. The root cause was that KafkaSendError left connected: true in producer state, causing connect() to short-circuit and send requests to queue behind KafkaJS's internal idempotent-producer lock (timeout ~65 536 ms).
The fix adds connected: false and lastRetry: Date.now() to the KafkaSendError state update, exactly mirroring the existing KafkaConnectionError handling so all subsequent requests immediately respect the 60-second reconnect cooldown and fall through to ClickHouse.
One side-effect: shutDown now skips kafka.disconnect() when connected is false after a send failure, since it gates the disconnect call on post.connected. The underlying KafkaJS producer will remain open until process exit in that scenario.
Confidence Score: 4/5
The change is safe to merge — the two added lines directly fix the documented production hang and introduce no new failure modes on the hot path.
The core fix is correct and well-scoped: connected: false + lastRetry: Date.now() in the KafkaSendError handler breaks the tight loop that caused lock-queue buildup. The only trade-off introduced is that shutDown now skips kafka.disconnect() when a send failure preceded graceful shutdown, because the disconnect gate reads post.connected. This is a narrow, low-impact path (process exit cleans up OS connections anyway), but it is a real behavioral change worth revisiting if the shutdown sequence ever needs to be more precise.
apps/basket/src/lib/producer.ts — specifically the shutDown function and its post.connected guard.
Important Files Changed
Filename
Overview
apps/basket/src/lib/producer.ts
Adds connected: false and lastRetry: Date.now() to the KafkaSendError handler, mirroring the KafkaConnectionError path and preventing the 65s lock-timeout hang. Minor regression: shutDown now skips kafka.disconnect() when a send error precedes graceful shutdown.
Sequence Diagram
sequenceDiagram
participant R as Request
participant SV as sendViaKafka
participant C as connect()
participant K as KafkaJS Producer
participant CH as ClickHouse Buffer
Note over R,CH: After fix — KafkaSendError path
R->>SV: sendViaKafka(topic, messages)
SV->>C: connect()
C->>C: "s.connected == false, check cooldown"
C-->>SV: return false (within 60s cooldown)
SV->>CH: bufferAll(fallbackEvents)
Note over R,CH: Before fix — KafkaSendError path (hang)
R->>SV: sendViaKafka(topic, messages)
SV->>C: connect()
C->>C: "s.connected == true, return true immediately"
SV->>K: kafka.send() queues behind idempotent lock
K-->>SV: KafkaJSLockTimeout after ~65s
SV->>CH: bufferAll(fallbackEvents)
Loading
Comments Outside Diff (1)
apps/basket/src/lib/producer.ts, line 423-437 (link)
kafka.disconnect() silently skipped after send failure
shutDown reads post.connected to decide whether to disconnect the underlying KafkaJS producer. Before this PR, a KafkaSendError left connected: true, so shutdown correctly called kafka.disconnect(). Now that the error handler sets connected: false, a graceful shutdown triggered while Redpanda is down (or during the 60-second cooldown window) will skip the disconnect entirely, leaving the KafkaJS producer's transport-level connection open until the process exits. The fix for the hang is correct, but the shutdown path should also account for whether the KafkaJS producer was ever successfully connected — a kafkaEverConnected flag or checking connectionFailed in addition to connected — so the disconnect call isn't gated solely on the application-level connected state flag.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When Redpanda becomes unavailable, the basket analytics ingestion service was blocking every
/batch,/vitals, and/trackrequest for ~65 seconds before falling back to ClickHouse. This affected all customer websites using the analytics SDK.The root cause is in
apps/basket/src/lib/producer.ts. After aKafkaSendError, the producer state update only setconnectionFailed: truebut leftconnected: trueandlastRetry: 0. Theconnect()helper short-circuits toreturn truewhenconnectedis set, so every subsequent concurrent request bypassed the reconnect cooldown and attempted anotherkafka.send(). Withidempotent: trueon the KafkaJS producer, these sends queue behind an internal lock. KafkaJS's lock timeout is hardcoded at ~65,536 ms, so 200–1700+ queued requests each waited the full lock timeout before receiving aKafkaJSLockTimeoutand falling back to ClickHouse.The
KafkaConnectionErrorpath already correctly setconnected: falseandlastRetry: Date.now()— this fix mirrors that same pattern in theKafkaSendErrorpath. After the first failed send, all subsequent requests immediately skip Kafka and buffer to ClickHouse without any lock contention.Alternatively, disabling
idempotent: trueon the KafkaJS producer would also eliminate the lock mechanism entirely (analytics is tolerant of rare duplicates), which could be a simpler long-term fix.Incident on Superlog
Was this PR helpful? Leave feedback — goes straight to the Superlog team.
Summary by cubic
Fixes 65s request hangs in basket analytics when Redpanda is down by resetting the producer’s connected state after a send failure. Requests now skip Kafka immediately and buffer to ClickHouse.
KafkaSendError, setconnected: falseandlastRetry = Date.now()(mirrorsKafkaConnectionError).connect()short-circuit and KafkaJS idempotent send lock buildup, removing the ~65s wait./batch,/vitals, and/trackno longer block; events buffer to ClickHouse until Redpanda recovers.Written for commit 01906e6. Summary will update on new commits.