Skip to content

fix: loop safety — max iteration guard and malformed config detection#254

Merged
pinodeca merged 1 commit into
microsoft:mainfrom
iemejia:fix/loop-safety
Jun 18, 2026
Merged

fix: loop safety — max iteration guard and malformed config detection#254
pinodeca merged 1 commit into
microsoft:mainfrom
iemejia:fix/loop-safety

Conversation

@iemejia

@iemejia iemejia commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Split from #221 to ease review.

M7: Loops now track iteration count across continue_as_new generations via FunctionInput.loop_iteration. After 100,000 iterations (~27 hours at the minimum 1-second rate limit), the loop fails with a clear error directing users to df.break() or workflow restructuring.

M8: If a LOOP node's condition config is unparseable JSON, the code now returns an error immediately instead of silently falling through to infinite looping.

Files: src/orchestrations/execute_function_graph.rs, src/types.rs, src/dsl.rs, src/lib.rs
Tests: 3 pg_tests

@pinodeca pinodeca left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with Opus 4.8

Issues / discussion points

  1. No test exercises the M7 runtime guard itself. All three tests cover serialization and DSL validation. The actual MAX_LOOP_ITERATIONS trip (the core behavior) is untested — triggering it needs 100k iterations, which is impractical in a pg_test. Consider making the limit injectable (e.g. a const overridable in tests, or a small helper that takes the max as a parameter) so a unit test can assert the failure fires at the boundary. As written, a regression in the counter math could pass CI silently.

  2. Behavioral change for intentionally long-running loops. A daemon-style loop (e.g. poll-a-queue-forever with df.break) will now hard-fail after ~27h rather than run indefinitely. That's a deliberate safety tradeoff, but it ends the instance in failed state. Two things worth confirming with the author:

    • Is failing (vs. gracefully completing with the last body result) the desired terminal state? The PR description says "fails with a clear error," so this seems intentional — but it's a semantic the USER_GUIDE should document.
    • Should the limit be a GUC rather than a hard-coded const, given different deployments may want different ceilings?
  3. Loops nested inside subtrees aren't covered by the guard. execute_function_graph.rs hardcodes loop_iteration: 0 and its custom input schema doesn't carry the counter, so a loop in a parallel/JOIN branch resets every generation. This appears to be pre-existing (subtree continue_as_new semantics already differ), not introduced here — but a one-line comment noting the guard only applies to top-level loops would prevent a false sense of safety.

  4. Minor: error message off-by-one wording. The guard fires when next_iteration >= MAX_LOOP_ITERATIONS, so the body executes 100,000 times and the message says "exceeded maximum iteration count of 100000" — technically it's failing at the limit, not after exceeding it. Cosmetic; ignore if you prefer.

Verdict

Approve-with-nits. Nothing blocks merge functionally; the changes are safe and backward-compatible. I'd ask for (1) a boundary test for the counter and (2) a doc/comment on the new fail-after-27h behavior before merge.

@pinodeca pinodeca left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

M7: Loops now track iteration count across continue_as_new generations
via FunctionInput.loop_iteration. After 100,000 iterations (~27 hours at
the minimum 1-second rate limit), the loop fails with a clear error
directing users to df.break() or workflow restructuring.

M8: If a LOOP node's condition config is unparseable JSON, the code now
returns an error immediately instead of silently falling through to
infinite looping.
@pinodeca pinodeca merged commit a26cab9 into microsoft:main Jun 18, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants