fix: guard against episode storm stalling foreground sessions by de1tydev · Pull Request #1844 · MemTensor/MemOS

de1tydev · 2026-05-31T14:53:55Z

Problem

Large merged episodes trigger a cascade of expensive post-processing (capture → reward → L2 induction → L3 abstraction → skill crystallization) that can stall OpenClaw and Hermes Agent foreground sessions. This is especially common in long development workflows where relation.classify consistently returns revision/follow_up, allowing a single episode to accumulate dozens or hundreds of turns.

Fixes #1755

Root Causes

No episode turn limit — episodes grow unbounded; the full L1→L2→L3→skill chain hits all at once when the topic finally ends
Synchronous classify in before_prompt_build — relation.classify() is an LLM call that blocks foreground prompt construction with no timeout
Unlimited background LLM concurrency — capture/reward/L2/L3/skill subscribers fire unlimited parallel LLM calls, starving the event loop

Changes

Fix 1: Episode turn hard limit (`maxTurnsPerEpisode`)

New config: algorithm.session.maxTurnsPerEpisode (default 30, range 5–200)
When an open episode reaches this turn count, the next turn forces a topic boundary regardless of relation classification
Also applies when reopening recovered episodes

Fix 2: Relation classify timeout (`classifyTimeoutMs`)

New config: algorithm.session.classifyTimeoutMs (default 5000ms, range 1000–30000)
relation.classify() calls are wrapped with Promise.race against the timeout
On timeout, defaults to new_task (safe conservative boundary)
Prevents foreground prompt construction from blocking indefinitely

Fix 3: Background LLM concurrency semaphore (`bgLlmConcurrency`)

New config: algorithm.session.bgLlmConcurrency (default 2, range 1–8)
Shared semaphore gates all LLM calls from capture, reward, L2, L3, skill, and feedback subscribers
Prevents event-loop starvation from concurrent background processing
Capture's existing llmConcurrency (per-step α scoring) is unaffected — the semaphore only applies to the shared LLM client used by post-capture processing

New Files

core/util/semaphore.ts — lightweight async semaphore
core/util/rate-limited-llm.ts — transparent LLM client wrapper that acquires a semaphore permit per call

Files Modified

core/config/schema.ts — 3 new config fields with JSDoc
core/config/defaults.ts — defaults: maxTurns=30, classifyTimeout=5s, bgConcurrency=2
core/pipeline/types.ts — SessionRoutingConfig extended
core/pipeline/deps.ts — config extraction + semaphore wiring
core/pipeline/orchestrator.ts — turn-limit guard + classify timeout wrapper

Testing

tsc --noEmit passes (no type errors)
All new config values have sensible defaults that preserve existing behavior for users who don't change them (the turn limit is the only behavior change: episodes that would have grown past 30 turns now get split)

…sor#1755)

de1tydev · 2026-05-31T14:53:59Z

Linked to #1755 — detailed root-cause analysis is in the issue comments.

fix: guard against episode storm stalling foreground sessions (MemTen…

2e8c3dd

…sor#1755)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: guard against episode storm stalling foreground sessions#1844

fix: guard against episode storm stalling foreground sessions#1844
de1tydev wants to merge 1 commit into
MemTensor:mainfrom
de1tydev:fix/episode-storm-guard

de1tydev commented May 31, 2026

Uh oh!

de1tydev commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

de1tydev commented May 31, 2026

Problem

Root Causes

Changes

Fix 1: Episode turn hard limit (maxTurnsPerEpisode)

Fix 2: Relation classify timeout (classifyTimeoutMs)

Fix 3: Background LLM concurrency semaphore (bgLlmConcurrency)

New Files

Files Modified

Testing

Uh oh!

de1tydev commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix 1: Episode turn hard limit (`maxTurnsPerEpisode`)

Fix 2: Relation classify timeout (`classifyTimeoutMs`)

Fix 3: Background LLM concurrency semaphore (`bgLlmConcurrency`)