Skip to content

fix: guard against episode storm stalling foreground sessions#1844

Open
de1tydev wants to merge 1 commit into
MemTensor:mainfrom
de1tydev:fix/episode-storm-guard
Open

fix: guard against episode storm stalling foreground sessions#1844
de1tydev wants to merge 1 commit into
MemTensor:mainfrom
de1tydev:fix/episode-storm-guard

Conversation

@de1tydev
Copy link
Copy Markdown

Problem

Large merged episodes trigger a cascade of expensive post-processing (capture → reward → L2 induction → L3 abstraction → skill crystallization) that can stall OpenClaw and Hermes Agent foreground sessions. This is especially common in long development workflows where relation.classify consistently returns revision/follow_up, allowing a single episode to accumulate dozens or hundreds of turns.

Fixes #1755

Root Causes

  1. No episode turn limit — episodes grow unbounded; the full L1→L2→L3→skill chain hits all at once when the topic finally ends
  2. Synchronous classify in before_prompt_buildrelation.classify() is an LLM call that blocks foreground prompt construction with no timeout
  3. Unlimited background LLM concurrency — capture/reward/L2/L3/skill subscribers fire unlimited parallel LLM calls, starving the event loop

Changes

Fix 1: Episode turn hard limit (maxTurnsPerEpisode)

  • New config: algorithm.session.maxTurnsPerEpisode (default 30, range 5–200)
  • When an open episode reaches this turn count, the next turn forces a topic boundary regardless of relation classification
  • Also applies when reopening recovered episodes

Fix 2: Relation classify timeout (classifyTimeoutMs)

  • New config: algorithm.session.classifyTimeoutMs (default 5000ms, range 1000–30000)
  • relation.classify() calls are wrapped with Promise.race against the timeout
  • On timeout, defaults to new_task (safe conservative boundary)
  • Prevents foreground prompt construction from blocking indefinitely

Fix 3: Background LLM concurrency semaphore (bgLlmConcurrency)

  • New config: algorithm.session.bgLlmConcurrency (default 2, range 1–8)
  • Shared semaphore gates all LLM calls from capture, reward, L2, L3, skill, and feedback subscribers
  • Prevents event-loop starvation from concurrent background processing
  • Capture's existing llmConcurrency (per-step α scoring) is unaffected — the semaphore only applies to the shared LLM client used by post-capture processing

New Files

  • core/util/semaphore.ts — lightweight async semaphore
  • core/util/rate-limited-llm.ts — transparent LLM client wrapper that acquires a semaphore permit per call

Files Modified

  • core/config/schema.ts — 3 new config fields with JSDoc
  • core/config/defaults.ts — defaults: maxTurns=30, classifyTimeout=5s, bgConcurrency=2
  • core/pipeline/types.ts — SessionRoutingConfig extended
  • core/pipeline/deps.ts — config extraction + semaphore wiring
  • core/pipeline/orchestrator.ts — turn-limit guard + classify timeout wrapper

Testing

  • tsc --noEmit passes (no type errors)
  • All new config values have sensible defaults that preserve existing behavior for users who don't change them (the turn limit is the only behavior change: episodes that would have grown past 30 turns now get split)

@de1tydev
Copy link
Copy Markdown
Author

Linked to #1755 — detailed root-cause analysis is in the issue comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

memos-local-plugin: large merged episodes can trigger L2/L3/skill-evolution storm and stall OpenClaw sessions

1 participant