fix(execution): fix isolated-vm memory leak and add worker recycling#4108
fix(execution): fix isolated-vm memory leak and add worker recycling#4108waleedlatif1 merged 5 commits intostagingfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview Adds worker recycling in the pool: workers track Reviewed by Cursor Bugbot for commit e05a36c. Configure here. |
Greptile SummaryThis PR fixes three root causes of SIGSEGV crashes (exit code 139) in ECS app tasks: (1) isolated-vm child handles ( Confidence Score: 5/5Safe to merge — all three root causes of the SIGSEGV crash are correctly addressed and the previous P1 pool-sizing concern is resolved. No P0 or P1 findings remain. The handle-release ordering (children before isolate.dispose), the pool-size accounting fix, and the --no-node-snapshot flag are all implemented correctly. The lifetimeExecutions counter is correctly incremented in the normal result and timeout paths with no double-counting risk. The retiring-worker drain-then-cleanup flow is sound. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A([Execution request]) --> B[acquireWorker]
B --> C{Non-retiring worker\nwith capacity?}
C -- Yes --> D[dispatchToWorker]
C -- No, pool not full\nexcluding retiring --> E[spawnWorker]
E --> D
C -- Pool at capacity --> F[enqueueExecution]
D --> G[[Worker isolate executes]]
G --> H{Complete / Timeout / SendFail}
H -- Normal result --> I[lifetimeExecutions++]
H -- Timeout +1s --> J[lifetimeExecutions++]
H -- SendFail --> K[resolve error]
I --> L{>= MAX_EXECUTIONS_PER_WORKER?}
J --> L
L -- Yes --> M[retiring = true]
L -- No --> N{retiring &&\nactiveExecutions == 0?}
M --> N
N -- Yes --> O[cleanupWorker]
N -- No --> P[resetWorkerIdleTimeout]
K --> P
O --> Q[resolve result / error]
P --> Q
Q --> R[drainQueue]
R --> S{Queue items?\npool below POOL_SIZE?}
S -- Yes --> E
S -- No --> T([Done])
Reviews (4): Last reviewed commit: "fix(execution): increment lifetimeExecut..." | Re-trigger Greptile |
|
@greptile |
|
@cursor review |
|
@greptile |
|
@cursor review |
|
@greptile |
|
@cursor review |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit e05a36c. Configure here.
Summary
Root cause: ECS app tasks were crashing with exit code 139 (SIGSEGV) after ~87 minutes of steady memory growth (~200MB/min). Three bugs:
isolate.dispose()was called in thefinallyblock. Per isolated-vm issue #198, disposing an isolate while child handles still exist causes stuck-GC states and native memory accumulation outside the V8 heap. Per issue #42, unreleased References hold their parent context alive indefinitely.--no-node-snapshotwas missing from worker spawns. isolated-vm v6 README explicitly requires this for Node.js 20+ due to shared V8 snapshot heap incompatibility (issue #377).Fix 1 (
isolated-vm-worker.cjs): Hoist all ivm handle declarations before thetryblock sofinallycan reach them. Release children (scripts, references, external copies, context) before callingisolate.dispose()— correct order per the library's issue tracker. All releases wrapped intry/catchso a bad isolate state can't crash the worker.Fix 2 (
isolated-vm.ts): Add graceful worker recycling after N lifetime executions (default 500, tunable viaIVM_MAX_EXECUTIONS_PER_WORKER). Workers are markedretiring— no new dispatches, in-flight executions drain normally, then the worker is cleaned up and a replacement spawns automatically. Zero user-visible impact.Fix 3 (
isolated-vm.ts): Add--no-node-snapshotflag to worker spawn — required by isolated-vm v6 for Node.js 20+.References
Type of Change
Testing
Tested manually — build passes clean
Checklist