Skip to content

fix(billing): retry refund correlation backfill 3x + dead-letter#3690

Merged
PierreBrisorgueil merged 7 commits into
masterfrom
fix/billing-refund-correlation-retry
May 22, 2026
Merged

fix(billing): retry refund correlation backfill 3x + dead-letter#3690
PierreBrisorgueil merged 7 commits into
masterfrom
fix/billing-refund-correlation-retry

Conversation

@PierreBrisorgueil
Copy link
Copy Markdown
Contributor

@PierreBrisorgueil PierreBrisorgueil commented May 21, 2026

Summary

  • Wraps PaymentIntent metadata backfill in checkout.session.completed handler with retryWithBackoff (3 attempts, 200ms + 400ms exp delays)
  • On exhaustion: log ERROR + write to new BillingFailedBackfill dead-letter collection (via minimal repository abstraction)
  • Inner try/catch around dead-letter write — never silent
  • RUNBOOK section 6 with mongo + Stripe CLI commands for manual reconciliation
  • Closes silent refund correlation loss path identified in audit P1

Context — audit P1 (2026-05-21)

Devkit audit flagged billing.webhook.service.js:314-330 as P1 : the metadata backfill failed silently with logger.warn + continue. Later refund webhook fell into "unresolved" path → customer credit never applied, runbook-only recovery. This PR adds defense in depth.

Strategy : Option B (new model + repository, no existing dead-letter mechanism for this concern). PR #3603's ProcessedStripeEvent dead-letter covers event-delivery failures (exhausted webhook retries), which is a different class.

Reviews completed before push :

  • Spec review : ✅ compliant + 1 cosmetic (RUNBOOK section ordering, fixed) + 1 followup filed as issue test(billing): use fake timers in checkout webhook unit test (slow regression) #3689 (slow-test pre-existing in billing.webhook.checkout.unit.tests.js:286)
  • Code-quality review : 3 IMPORTANT (JSDoc delay math, lazy mongoose.model layer break, sparse-index no-op) + 3 MINOR → fixes in c39b9c3f
  • DeepSeek pre-push gate : OK with nits — 2 low (err.stack dropped, no standalone retry tests) + 2 nits (failedAt triple-default, attempts<=0 unguarded) ; all deferrable

Test plan

  • NODE_ENV=devkit npm run lint — clean
  • NODE_ENV=test npm run test:unit -- billing.refund-correlation — 3/3 pass
  • NODE_ENV=test npm run test:unit -- billing.webhook — 1530/1530 pass
  • No silent .catch — inner DLQ catch logs at ERROR
  • CI green
  • CodeRabbit clean

Followup

Plan

docs/superpowers/plans/2026-05-21-devkit-audit-p0-p1-fixes.md Phase 2.

Summary by CodeRabbit

  • New Features

    • Improved billing reliability with automatic retry and dead-letter recording for failed refund correlation backfills.
  • Documentation

    • Added an operational runbook detailing triage and recovery steps for refund backfill failures and when to escalate.
  • Tests

    • Added unit tests covering retry behavior, dead-letter recording, and related error-handling scenarios.

Review Change Stack

PaymentIntent metadata backfill on checkout.session.completed now retries
3 times with exponential backoff (200ms, 400ms, 800ms) before escalating.
On exhaustion, a persistent BillingFailedBackfill dead-letter record is
written so operators can manually patch missing PI metadata and preserve
refund correlation.

Strategy chosen: Option B — new BillingFailedBackfill model (no prior
mechanism for this specific failure path). The existing ProcessedStripeEvent
dead-letter covers Stripe event delivery failures; this addresses a
different failure class (Stripe API write failure on a side-effect within
a successfully-delivered event handler).

BillingFailedBackfill model auto-registers via assets.js glob:
modules/*/models/*.mongoose.js — no billing.init.js change required.

Closes audit P1 (2026-05-21) — silent warn-log path on refund correlation.
- IMPORTANT: JSDoc delay sequence corrected (200 → 400, not 200 → 400 → 800
  for default attempts=3, baseMs=200 — only 2 delays fire)
- IMPORTANT: BillingFailedBackfill access moved to billing.failedBackfill.repository.js
  (Option B) — lazy mongoose.model() removed from service layer; test mocks updated
  to stub the repository boundary instead of mongoose.model()
- IMPORTANT: sparse index → partial index on resolvedAt (sparse no-op because field
  has default: null, all docs include it — partial targets only unresolved docs)
- MINOR: RUNBOOK section 6 symptom includes [billing.webhook] prefix
Copilot AI review requested due to automatic review settings May 21, 2026 20:20
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Caution

Review failed

Failed to post review comments

Walkthrough

Adds an exponential backoff retry helper, a Mongoose dead-letter model and repository for PaymentIntent metadata backfill failures, integrates retries+dead-letter writes into the billing webhook, adds unit tests for retry/failed-write scenarios, and documents runbook triage steps.

Changes

PaymentIntent Backfill Resilience

Layer / File(s) Summary
Exponential backoff retry utility
modules/billing/lib/billing.retry.js
retryWithBackoff(fn, { attempts = 3, baseMs = 200 }) with input validation, exponential backoff, returns on first success or throws last error after attempts.
Dead-letter data model and repository
modules/billing/models/billing.failedBackfill.model.mongoose.js, modules/billing/repositories/billing.failedBackfill.repository.js
BillingFailedBackfill Mongoose schema for unresolved backfill failures (identifiers, error, timestamps, resolution metadata) and record() repository to persist failures with defaulted fields.
PaymentIntent metadata backfill with retry and dead-letter
modules/billing/services/billing.webhook.service.js
Replaces direct stripe.paymentIntents.update with retryWithBackoff in checkout.session.completed; on final failure logs an error and persists a failed-backfill record, logging again if persistence fails.
Refund correlation backfill retry scenario tests
modules/billing/tests/billing.refund-correlation.unit.tests.js
Jest tests (fake timers) covering transient failures that succeed, persistent failures that record dead-letters and error logs, dead-letter write failures, and retryWithBackoff input guards.
Billing runbook for backfill failures
modules/billing/RUNBOOKS.md
Appends Runbook #7 describing failure symptom, triage steps (query unresolved, patch PaymentIntent metadata, mark resolved, replay dead-letter refunds), and an escalation rule.

Sequence Diagram

sequenceDiagram
  participant CheckoutHandler as handleCheckoutPaymentCompleted
  participant RetryHelper as retryWithBackoff
  participant StripeAPI as stripe.paymentIntents.update
  participant DeadLetterRepo as BillingFailedBackfillRepository
  participant Logger as logger

  CheckoutHandler->>RetryHelper: call retryWithBackoff(() => StripeAPI(update metadata))
  RetryHelper->>StripeAPI: paymentIntents.update(...) attempt
  alt succeeds within attempts
    StripeAPI-->>RetryHelper: success
    RetryHelper-->>CheckoutHandler: resolve
  else all attempts fail
    StripeAPI-->>RetryHelper: error
    RetryHelper-->>CheckoutHandler: reject (last error)
    CheckoutHandler->>Logger: error("backfill failed after retries", paymentIntentId)
    CheckoutHandler->>DeadLetterRepo: record({ paymentIntentId, stripeSessionId, error })
    alt dead-letter write fails
      DeadLetterRepo-->>CheckoutHandler: error
      CheckoutHandler->>Logger: error("failed to persist backfill record", paymentIntentId)
    end
    CheckoutHandler-->>CheckoutHandler: handler resolves (no throw)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

  • pierreb-devkit/Node#3601: Related billing webhook changes around PaymentIntent stripeSessionId backfill and refund-correlation handling.
  • pierreb-devkit/Node#3537: Related webhook edits around checkout payment handling and idempotent dispatch surrounding backfill logic.
  • pierreb-devkit/Node#3623: Related edits to handleCheckoutPaymentCompleted around backfill and retry/idempotency handling.

Suggested labels

Fix

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding retry logic with exponential backoff (3x) and dead-letter recording for refund correlation backfill, which is the core purpose of this PR.
Description check ✅ Passed The PR description covers all required template sections: summary of changes, context with audit details, scope (billing module, low risk), validation (tests passing, linting clean), guardrails (no secrets, tests added), and notes for reviewers addressing security and mergeability.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/billing-refund-correlation-retry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.70%. Comparing base (d397764) to head (da51b15).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3690      +/-   ##
==========================================
+ Coverage   89.66%   89.70%   +0.04%     
==========================================
  Files         140      142       +2     
  Lines        4759     4780      +21     
  Branches     1491     1498       +7     
==========================================
+ Hits         4267     4288      +21     
  Misses        385      385              
  Partials      107      107              
Flag Coverage Δ
integration 59.49% <8.33%> (-0.23%) ⬇️
unit 66.15% <100.00%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d397764...da51b15. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds retry + dead-letter handling for the Stripe PaymentIntent metadata backfill performed during checkout.session.completed, to prevent silent failures that later break refund correlation and require manual recovery.

Changes:

  • Wrap PaymentIntent metadata backfill in retryWithBackoff (3 attempts, exponential backoff) and escalate on exhaustion.
  • Persist exhausted backfill failures to a new BillingFailedBackfill dead-letter collection (repository + model) and document manual reconciliation steps in the billing runbook.
  • Add targeted unit tests covering retry success, retry exhaustion → dead-letter, and dead-letter write failure handling.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
modules/billing/services/billing.webhook.service.js Adds retry/backoff around PI metadata patch and records exhausted failures to a backfill DLQ with error logging.
modules/billing/lib/billing.retry.js Introduces a shared retryWithBackoff helper used by the webhook backfill.
modules/billing/repositories/billing.failedBackfill.repository.js Adds a minimal repository abstraction for recording backfill dead-letter entries.
modules/billing/models/billing.failedBackfill.model.mongoose.js Defines the BillingFailedBackfill Mongoose model and unresolved-only index for ops queries.
modules/billing/tests/billing.refund-correlation.unit.tests.js Adds unit coverage for retry behavior and DLQ escalation/robustness.
modules/billing/RUNBOOKS.md Adds a new runbook section for manual remediation of backfill failures.

Comment thread modules/billing/lib/billing.retry.js
Comment thread modules/billing/services/billing.webhook.service.js
Comment thread modules/billing/services/billing.webhook.service.js
Comment thread modules/billing/tests/billing.refund-correlation.unit.tests.js
@PierreBrisorgueil
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai[bot]
coderabbitai Bot previously requested changes May 22, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modules/billing/lib/billing.retry.js`:
- Around line 17-30: The retryWithBackoff function should validate its options
before the loop: check that attempts is a finite positive integer (>=1) and
baseMs is a finite non-negative number; if not, throw a clear TypeError
describing the invalid option(s). Add this validation at the start of
retryWithBackoff (before using attempts or baseMs) so you never enter the loop
with attempts <= 0 or non-integer values and avoid throwing undefined via
lastErr; reference the retryWithBackoff function and the attempts/baseMs
parameters when implementing the checks.

In `@modules/billing/tests/billing.refund-correlation.unit.tests.js`:
- Around line 26-31: Add a JSDoc header above the standalone makeSession helper:
include a one-line description, a `@returns` tag describing the returned session
object shape (id, payment_status, payment_intent, metadata with organizationId,
packId, kind) and types (e.g., `@returns` {Object}), and if you prefer to be
explicit add an empty `@param` comment noting no parameters; attach this header
immediately above the makeSession function declaration.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ed6dfd15-0206-4beb-806d-e58f3fd851ce

📥 Commits

Reviewing files that changed from the base of the PR and between da02947 and c39b9c3.

📒 Files selected for processing (6)
  • modules/billing/RUNBOOKS.md
  • modules/billing/lib/billing.retry.js
  • modules/billing/models/billing.failedBackfill.model.mongoose.js
  • modules/billing/repositories/billing.failedBackfill.repository.js
  • modules/billing/services/billing.webhook.service.js
  • modules/billing/tests/billing.refund-correlation.unit.tests.js

Comment thread modules/billing/lib/billing.retry.js
Comment thread modules/billing/tests/billing.refund-correlation.unit.tests.js
- guard attempts (positive integer) + baseMs (non-negative finite) with
  TypeError — prevents entering loop with attempts<=0 (would throw undefined)
- JSDoc header on makeSession test helper

Addresses CodeRabbit re-review on PR #3690.
@PierreBrisorgueil PierreBrisorgueil dismissed coderabbitai[bot]’s stale review May 22, 2026 07:05

Stale: this CHANGES_REQUESTED targets c39b9c3 (06:51:16Z). Both findings — (1) retryWithBackoff attempts/baseMs validation, (2) makeSession JSDoc — fixed in dbb9c25 (06:53:34Z, postdates this review) + 2 guard tests added, 1484/1484 unit tests green. CR's incremental engine did not auto-re-review the new commit. Dismissing per stale-review convention.

…ill logs

Fixes two logger.error calls (lines 332 and 344 of billing.webhook.service.js):
- rename sessionId key to stripeSessionId for consistency with the rest of the file
- add stack: err?.stack to both meta objects, matching surrounding logger.error style

Also updates billing.refund-correlation unit test header comment to reference
BillingFailedBackfillRepository.record instead of the stale BillingFailedBackfill.create.
…correlation-retry

# Conflicts:
#	modules/billing/RUNBOOKS.md
@PierreBrisorgueil PierreBrisorgueil merged commit 1b9f7a6 into master May 22, 2026
8 checks passed
@PierreBrisorgueil PierreBrisorgueil deleted the fix/billing-refund-correlation-retry branch May 22, 2026 09:19
PierreBrisorgueil added a commit that referenced this pull request May 22, 2026
…errors (#3697)

* fix(billing): short-circuit retryWithBackoff on non-transient Stripe errors

retryWithBackoff retried every error type, wasting ~600ms of backoff on
deterministic StripeInvalidRequestError before dead-lettering. Add an
optional shouldRetry(err) predicate (default: always retry) and pass a
non-transient classifier at the PI-backfill call site.

The err.stack finding from the issue is already satisfied by #3690
(both logger.error calls include stack) — no change needed.

Closes #3691

* test(billing): cover StripeInvalidRequestError class-name branch in retry short-circuit

The shouldRetry predicate guards both Stripe error type spellings
(StripeInvalidRequestError class name + invalid_request_error raw type,
which vary by stripe-node version). Add a parallel short-circuit test for
the class-name value and align the existing test's predicate with the
production one. Closes a coverage gap flagged by the pre-push gate.

* fix(billing): address reviewer nits — narrow comment + improve log wording

- Narrow shouldRetry JSDoc comment to only mention invalid_request_error
  (not auth errors) — matches what the predicate actually checks
  (Copilot thread PRRT_kwDOBss37M6EM5HL)
- Change failure log from "failed after retries" to "failed (retries
  exhausted or skipped)" to reflect short-circuit path
  (CodeRabbit thread PRRT_kwDOBss37M6EM5c1)
- Update billing.refund-correlation tests to match new log wording
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants