feat(workflows): add continue_on_error step field for non-halting failures#2663
feat(workflows): add continue_on_error step field for non-halting failures#2663doquanghuy wants to merge 2 commits into
Conversation
f34ab4c to
da8ed4d
Compare
| By default, a non-zero exit code from any step halts the entire run. | ||
| Set `continue_on_error: true` on a step to record its result and | ||
| continue to the next sibling step instead. The exit code remains | ||
| available on `steps.<id>.output.exit_code` so downstream `if`, | ||
| `switch`, or `gate` steps can branch on it: |
| assert not any( | ||
| "continue_on_error" in e for e in errors | ||
| ), errors |
| monkeypatch.setattr(gate_module.sys.stdin, "isatty", lambda: True) | ||
| monkeypatch.setattr( |
|
@mnriem — addressed all three Copilot findings in 1. README wording. Rewrote the "Error Handling" intro in terms of 2. Validation test contract tightened. 3. Gate stdin patching made runner-robust. In Full suite still passes (continue_on_error: 7/7), no regressions. Branch is AI disclosure: drafted with Claude Opus, human-reviewed. |
Closes github#2591. Adds an optional `continue_on_error: bool` field on every step. When set to `true` and the step fails, the engine records the result (exit_code, stderr, status) into `steps.<id>.output` and continues to the next sibling step instead of halting the run. Downstream `if`, `switch`, or `gate` steps can then branch on `{{ steps.<id>.output.exit_code }}` to route the recovery path. This composes with primitives that already exist (the exit code is already captured, the expression engine already resolves it, and `if`/`switch`/`gate` are already available) — the only gap was that a non-zero exit hard-stopped the pipeline before any downstream step could evaluate it. ### Engine `WorkflowEngine._execute_steps` now consults the step config when a step returns `StepStatus.FAILED`: - Gate aborts (`output.aborted`) always halt the run — operator decisions take precedence over the flag. - Otherwise, if `continue_on_error: true`, log a `step_continue_on_error` event and proceed to the next sibling. - Otherwise, behave as before: set `RunStatus.FAILED` and return. ### Validation `_validate_steps` rejects non-bool values for `continue_on_error`. Coerced strings like `"true"` are not accepted so authoring mistakes surface at validation time rather than silently changing run semantics. ### Default behaviour preserved When `continue_on_error` is omitted, every code path is byte-equivalent to before this change. Existing workflows see no difference. ### Tests New `TestContinueOnError` class in `tests/test_workflows.py` covers all four scenarios from the issue's acceptance criteria plus two extras: - undeclared (default) failure halts the run. - declared-and-fired continues past the failure. - declared-but-step-succeeded is a no-op (flag only matters on FAILED). - if-branch end-to-end exercising the canonical recovery pattern from the issue discussion. - gate abort still halts even with `continue_on_error: true` set. - validation rejects non-bool values; accepts both `true` and `false` cleanly. ### Docs Adds an "Error Handling" section to `workflows/README.md` documenting the field, the gate-abort precedence rule, and the canonical recovery pattern. ### Follow-on Auto-retry-on-transient (e.g. retry a 429 at 3 AM without operator attendance) is intentionally out of scope. The current proposal covers the **skip** and **abort** verdicts from the original discussion; the **retry** verdict still pauses for an operator at the gate step. A future loop/retry-count primitive or an auto-approving gate could close that gap on top of this mechanism without further engine changes.
- Reword README "Error Handling" intro in terms of `StepStatus.FAILED` halting by default, with non-zero shell/command exit as one common cause. Avoids implying only exit codes can halt a run (gate aborts and validation failures also do, just via different mechanisms). - Tighten `test_validation_accepts_bool_continue_on_error` to assert `errors == []` instead of "no error mentions continue_on_error", so unrelated validation regressions on the same minimal YAML can no longer slip past this test. - In `test_gate_abort_still_halts_with_continue_on_error`, swap `sys.stdin` itself for a stub `_TTYStdin` instead of patching `sys.stdin.isatty`. Method-on-instance assignment is unreliable on real `io.TextIOWrapper` objects (e.g. under pytest with capture disabled), so replacing the whole stdin object is more robust across runners. All 2967 tests still pass.
e9a4871 to
a0e78ee
Compare
Description
Closes #2591.
Adds an optional
continue_on_error: boolfield on every step.When set to
trueand the step fails, the engine records theresult (exit_code, stderr, status) into
steps.<id>.outputandcontinues to the next sibling step instead of halting the run.
Downstream
if,switch, orgatesteps can then branch on{{ steps.<id>.output.exit_code }}to route the recovery path.This is the shape @mnriem proposed in the issue discussion —
it composes with primitives that already exist (the exit code
is already captured, the expression engine already resolves it,
and
if/switch/gateare already available). The only gapwas that a non-zero exit hard-stopped the pipeline before any
downstream step could evaluate it.
Canonical usage
Engine
WorkflowEngine._execute_stepsnow consults the step config whena step returns
StepStatus.FAILED:output.aborted) always halt the run — operatordecisions take precedence over the flag.
continue_on_error: true, log astep_continue_on_errorevent and proceed to the next sibling.step_failed, setRunStatus.FAILED, and return.Exactly one event per failure-resolution path is logged so the
log timeline is unambiguous: either the run continued past the
failure or it halted.
Validation
_validate_stepsrejects non-bool values forcontinue_on_error.Coerced strings like
"true"are not accepted so authoringmistakes surface at validation time rather than silently
changing run semantics.
Default behaviour preserved
When
continue_on_erroris omitted, every code path isbyte-equivalent to before this change. Existing workflows see no
difference.
Verdict coverage (from the issue discussion)
continue_on_error: true+ifbranches around the failurecontinue_on_error: true+gate→ operator approves →resumere-runs from gateFully unattended retry-on-transient (e.g. retry a 429 at 3 AM
without operator attendance) is intentionally out of scope here.
The skip and abort verdicts work without a human; the
retry verdict still pauses for one at the gate. A future
loop/retry-count primitive or an auto-approving gate type could
close that gap on top of this mechanism without further engine
changes — happy to follow up on that in a separate issue if
useful.
Testing
uv run specify --helpuv sync && uv run pytest→ 2967 passed, 35 skipped (was 2960 before; +7 new
tests added in this PR).
the middle step exits non-zero. Without
continue_on_error, run halts at the failing step (asbefore). With
continue_on_error: true, the failing steprecords
exit_codeand the third step executes. Adownstream
ifbranching on{{ steps.flaky.output.exit_code != 0 }}routes into arecovery
gatecleanly.New test coverage
TestContinueOnErrorintests/test_workflows.py:test_undeclared_failure_halts_runtest_declared_and_fired_continues_runtest_declared_but_step_succeeded_is_nooptest_if_branch_routes_around_failuretest_gate_abort_still_halts_with_continue_on_errortest_validation_rejects_non_bool_continue_on_error"true"(string) fails validation.test_validation_accepts_bool_continue_on_errortrueandfalsepass validation cleanly.AI Disclosure
Used Claude Opus to draft the engine change, the test suite, the
docs section, and this PR body. The shape (
continue_on_errorproposed by @mnriem on the issue thread; this PR implements that
proposal. Code, tests, and design decisions were human-reviewed
before submission.