feat(billing): distributed lock for multi-pod crons#3688
Conversation
Mongo-backed TTL doc lock with auto-expiry. Acquired via findOneAndUpdate upsert with stale-or-absent predicate; released by holder match. Addresses devkit audit P1 (2026-05-21) — billing crons race when K8s concurrencyPolicy bypass occurs (e.g. crash post-jitter pre-finalize).
weeklyReset, dunningSweep, extrasExpiration now acquire a distributed lock before mutating subscriptions. Skip-on-contention with info log; release in finally for clean recovery. Also fixes Mongoose 8 deprecation: `new: true` → `returnDocument: 'after'` in findOneAndUpdate. Closes audit P1 — multi-pod race on crash post-jitter pre-finalize.
- Move distributedLock unit test to canonical lib location (or document deviation) - Add releaseLock throw-path test + non-fatal comment in 3 crons' finally - Harmonize log style in skip-on-contention path - Rename + document integration test (in-process concurrency clarification) - Add findOne verification step in RUNBOOKS before deleteOne Addresses code-quality review I1, I2, M1, M3, M4.
- HIGH: restore (errors || desyncErrors) condition in dunningSweep exit code (regression introduced when wrapping with lock — would make K8s alerts silent) - MEDIUM: wrap releaseLock in try/catch within finally to preserve original work error if both throw - LOW: guard acquireLock against invalid ttlMs (must be positive finite) - NIT: drop redundant setDefaultsOnInsert (all fields explicitly $set) Addresses gate iteration 1 BLOCK.
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (7)
WalkthroughThis PR introduces MongoDB-backed distributed locking to prevent concurrent execution of billing cron jobs across multiple Kubernetes pods. Three cron scripts now acquire a lock before executing, skip work if the lock is unavailable, and release the lock via TTL expiry or explicit deletion. ChangesDistributed lock for billing cron concurrency
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3688 +/- ##
==========================================
+ Coverage 89.62% 89.66% +0.03%
==========================================
Files 139 140 +1
Lines 4744 4759 +15
Branches 1488 1491 +3
==========================================
+ Hits 4252 4267 +15
Misses 385 385
Partials 107 107
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR introduces a MongoDB-backed distributed lock primitive to prevent multi-pod billing CronJobs from running concurrently, integrates it into the key billing crons, and adds documentation + tests to validate the locking contract end-to-end.
Changes:
- Add
lib/distributedLock.js(Mongo TTL-doc based lock) with unit + integration tests. - Wrap
weeklyReset,dunningSweep, andextrasExpirationbilling crons with acquire/skip/release lock behavior. - Document operational handling for stuck locks in the billing runbooks and cron README.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| modules/billing/tests/billing.cron-lock.integration.tests.js | New integration coverage for lock acquire/release/expiry behavior against real MongoDB. |
| modules/billing/RUNBOOKS.md | Adds a runbook entry for diagnosing and clearing stuck cron locks. |
| modules/billing/crons/README.md | Documents cron lock purpose, TTLs, and expected “skipping” logs. |
| modules/billing/crons/billing.weeklyReset.js | Adds distributed lock acquisition/release around weekly reset execution. |
| modules/billing/crons/billing.extrasExpiration.js | Adds distributed lock acquisition/release around extras expiration sweep. |
| modules/billing/crons/billing.dunningSweep.js | Adds distributed lock acquisition/release around dunning sweep and restores exit-code behavior on desync errors. |
| lib/tests/distributedLock.unit.tests.js | New unit tests with mocked Mongoose for lock behavior and error handling. |
| lib/distributedLock.js | New lock primitive (CronLock model + acquireLock/releaseLock). |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@lib/distributedLock.js`:
- Around line 52-69: The current acquireLock in distributedLock.js only sets
lockedUntil once (in acquireLock) which allows the lease to expire during long
runs; add a renewal/heartbeat mechanism and a safe renew API: implement a
renewLock({name, holder, ttlMs}) that performs a conditional findOneAndUpdate
matching {_id: name, holder} and sets lockedUntil to now+ttlMs (return true only
if holder matches), call that renew periodically from the long-running worker
while it holds the lock (e.g., setInterval cleared on finish) and ensure
renewLock validates ttlMs like acquireLock; alternatively implement a fencing
pattern by embedding a monotonic epoch/token (holder already used) and always
require the holder match when extending or releasing so a stale process cannot
steal the lock.
- Around line 19-35: The new shared distributed lock (LockSchema and exported
CronLock) must live under the approved shared code locations; either move this
module into lib/helpers/ or lib/services/ and update all imports that reference
CronLock to the new path, or if there is a deliberate exception, add a
top-of-file comment justifying why it must remain at lib/distributedLock.js
(include owner, rationale, and link to the exemption approval) and run a
repo-wide search to update/verify imports; ensure the exported symbol CronLock
and the schema name LockSchema remain unchanged so dependent modules still
resolve correctly.
In `@modules/billing/crons/billing.dunningSweep.js`:
- Around line 119-122: The logger.error call in the dunningSweep lock-release
failure currently passes metadata first and message second; change it to pass
the message string first and the metadata object second (i.e., flip the
arguments for logger.error) so it matches the rest of the file and other billing
crons—update the call that references LOCK_NAME and releaseErr in the
dunningSweep lock-release error handling to use logger.error(message, { err:
releaseErr, cron: LOCK_NAME }).
In `@modules/billing/crons/billing.extrasExpiration.js`:
- Around line 89-92: The logger.error call that currently passes meta first
should be changed to message-first ordering to match other calls: call
logger.error with the string '[cron.extrasExpiration] failed to release lock —
will auto-expire on TTL' as the first argument and the metadata object { err:
releaseErr, cron: LOCK_NAME } as the second; update the logger.error invocation
that references releaseErr and LOCK_NAME in the extrasExpiration cron to use
this ordering so breadcrumb format is consistent.
In `@modules/billing/crons/README.md`:
- Around line 136-139: Update the runbook section reference in the README:
replace the text `## Cron lock stuck` with the exact header `## 6 — Cron lock
stuck` so the link/mention matches the actual section title in the billing
runbook; edit the sentence in modules/billing/crons/README.md that currently
reads "See the runbook entry `## Cron lock stuck` in
`modules/billing/RUNBOOKS.md` for manual resolution." to use the exact header
`## 6 — Cron lock stuck`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 845b6d9b-4d37-45e6-a5d9-3131a62e9fa2
📒 Files selected for processing (8)
lib/distributedLock.jslib/tests/distributedLock.unit.tests.jsmodules/billing/RUNBOOKS.mdmodules/billing/crons/README.mdmodules/billing/crons/billing.dunningSweep.jsmodules/billing/crons/billing.extrasExpiration.jsmodules/billing/crons/billing.weeklyReset.jsmodules/billing/tests/billing.cron-lock.integration.tests.js
- Move lib/distributedLock.js → lib/services/distributedLock.js (aligns with existing convention for service-level primitives) - Move lib/tests/distributedLock.unit.tests.js → lib/services/tests/ - Update import paths in 3 billing crons + integration tests - Flip logger.error arg order in dunningSweep + extrasExpiration finally (match file convention: (message, meta) not (meta, message)) - Same fix applied to weeklyReset (same pattern found, not flagged by CR) - Update modules/billing/crons/README.md → reference '## 6 — Cron lock stuck' (exact match with RUNBOOKS section header) Heartbeat/renewLock suggestion intentionally NOT added — see PR reply.
|
@coderabbitai Thanks for the thorough review. Quick context on the 5 findings: #2-#5 addressed in commit
#1 (renewLock heartbeat) — intentionally not added. Rationale:
Happy to revisit if you have a specific scenario in mind that the current TTL margin does not cover. |
|
Tip For best results, initiate chat on the files or code changes.
The design is intentional and documented, which is exactly the right place to leave it. No further action needed on 🐇 ✨ ✏️ Learnings added
|
… clarify lock test name Resolves Copilot MissingSchemaError finding: dunningSweep and extrasExpiration called mongooseService.connect() without loadModels(), risking MissingSchemaError on repository top-level mongoose.model() calls. Matches weeklyReset/reconcile pattern. Also renames test 'ttlMs < 0' → 'existing lock has expired (lockedUntil < now)' to accurately describe the scenario being exercised.
Summary
lib/distributedLock.js— Mongo TTL-doc lock primitive (acquireLock/releaseLock, holder-bound release, app-level expiry check decoupled from Mongo's 60s TTL monitor)weeklyReset/dunningSweep/extrasExpiration) with skip-on-contention behavior(errors > 0 || desyncErrors > 0)condition indunningSweepexit code (pre-existing regression — K8s alerts were silent on Org.plan desync failures)Context — audit P1 (2026-05-21)
Devkit audit identified billing crons race risk when K8s
concurrencyPolicy: Forbidis bypassed (e.g. pod crash post-jitter pre-finalize). Distributed lock provides defense-in-depth.Reviews completed before push :
d3264b8bb971b2ccDoc-sync ack
## 6 — Cron lock stuck; OK since master section count was 5## Concurrency controlsection ; OKTest plan
NODE_ENV=test npm run test:unit— 1493 tests passNODE_ENV=test npm run test:integration -- billing.cron-lock— 6 tests passNODE_ENV=devkit npm run lint— clean\$sortArrayintegration failures unrelated (also failing on master)Plan
docs/superpowers/plans/2026-05-21-devkit-audit-p0-p1-fixes.mdPhase 1.Summary by CodeRabbit
New Features
Documentation
Tests