feat(workflows): add continue_on_error step field for non-halting failures by doquanghuy · Pull Request #2663 · github/spec-kit

doquanghuy · 2026-05-21T14:34:47Z

Description

Closes #2591.

Adds an optional continue_on_error: bool field on every step.
When set to true and the step fails, the engine records the
result (exit_code, stderr, status) into steps.<id>.output and
continues to the next sibling step instead of halting the run.
Downstream if, switch, or gate steps can then branch on
{{ steps.<id>.output.exit_code }} to route the recovery path.

This is the shape @mnriem proposed in the issue discussion —
it composes with primitives that already exist (the exit code
is already captured, the expression engine already resolves it,
and if/switch/gate are already available). The only gap
was that a non-zero exit hard-stopped the pipeline before any
downstream step could evaluate it.

Canonical usage

- id: heavy-thing
  type: command
  integration: claude
  command: speckit.heavy-thing
  continue_on_error: true

- id: check-result
  type: if
  condition: "{{ steps.heavy-thing.output.exit_code != 0 }}"
  then:
    - id: review
      type: gate
      message: "Step failed (exit {{ steps.heavy-thing.output.exit_code }}). Retry or skip?"
      on_reject: skip
  else:
    - id: next-thing
      command: speckit.next-thing

Engine

WorkflowEngine._execute_steps now consults the step config when
a step returns StepStatus.FAILED:

Gate aborts (output.aborted) always halt the run — operator
decisions take precedence over the flag.
Otherwise, if continue_on_error: true, log a
step_continue_on_error event and proceed to the next sibling.
Otherwise, behave as before: log step_failed, set
RunStatus.FAILED, and return.

Exactly one event per failure-resolution path is logged so the
log timeline is unambiguous: either the run continued past the
failure or it halted.

Validation

_validate_steps rejects non-bool values for continue_on_error.
Coerced strings like "true" are not accepted so authoring
mistakes surface at validation time rather than silently
changing run semantics.

Default behaviour preserved

When continue_on_error is omitted, every code path is
byte-equivalent to before this change. Existing workflows see no
difference.

Verdict coverage (from the issue discussion)

Scenario	How
Skip	`continue_on_error: true` + `if` branches around the failure
Abort	Omit the flag — today's default halts the run
Retry	`continue_on_error: true` + `gate` → operator approves → `resume` re-runs from gate

Fully unattended retry-on-transient (e.g. retry a 429 at 3 AM
without operator attendance) is intentionally out of scope here.
The skip and abort verdicts work without a human; the
retry verdict still pauses for one at the gate. A future
loop/retry-count primitive or an auto-approving gate type could
close that gap on top of this mechanism without further engine
changes — happy to follow up on that in a separate issue if
useful.

Testing

Tested locally with uv run specify --help
Ran existing tests with uv sync && uv run pytest
→ 2967 passed, 35 skipped (was 2960 before; +7 new
tests added in this PR).
Tested with a sample project: ran a 3-step workflow where
the middle step exits non-zero. Without
continue_on_error, run halts at the failing step (as
before). With continue_on_error: true, the failing step
records exit_code and the third step executes. A
downstream if branching on
{{ steps.flaky.output.exit_code != 0 }} routes into a
recovery gate cleanly.

New test coverage

TestContinueOnError in tests/test_workflows.py:

Test	What it locks
`test_undeclared_failure_halts_run`	Default behaviour byte-equivalent — no flag → run halts on first non-zero exit.
`test_declared_and_fired_continues_run`	Flag set + step fails → run continues, exit_code recorded.
`test_declared_but_step_succeeded_is_noop`	Flag set + step succeeds → no behaviour change.
`test_if_branch_routes_around_failure`	End-to-end recovery pattern from the issue discussion.
`test_gate_abort_still_halts_with_continue_on_error`	Operator-driven gate abort always halts, even with the flag set.
`test_validation_rejects_non_bool_continue_on_error`	`"true"` (string) fails validation.
`test_validation_accepts_bool_continue_on_error`	`true` and `false` pass validation cleanly.

AI Disclosure

I did not use AI assistance for this contribution
I did use AI assistance (described below)

Used Claude Opus to draft the engine change, the test suite, the
docs section, and this PR body. The shape (continue_on_error

exit-code-as-API + branch on it via existing primitives) was
proposed by @mnriem on the issue thread; this PR implements that
proposal. Code, tests, and design decisions were human-reviewed
before submission.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot's findings

Files reviewed: 3/3 changed files
Comments generated: 3

+By default, a non-zero exit code from any step halts the entire run.
+Set `continue_on_error: true` on a step to record its result and
+continue to the next sibling step instead. The exit code remains
+available on `steps.<id>.output.exit_code` so downstream `if`,
+`switch`, or `gate` steps can branch on it:


+            assert not any(
+                "continue_on_error" in e for e in errors
+            ), errors


+        monkeypatch.setattr(gate_module.sys.stdin, "isatty", lambda: True)
+        monkeypatch.setattr(


doquanghuy · 2026-05-27T12:47:08Z

@mnriem — addressed all three Copilot findings in e9a4871, and rebased the branch onto current main (clean rebase, no conflicts).

1. README wording. Rewrote the "Error Handling" intro in terms of StepStatus.FAILED halting by default, with non-zero shell / command exit as one common cause. Avoids implying only exit codes can halt a run (gate aborts and validation failures also do, just via different mechanisms).

2. Validation test contract tightened. test_validation_accepts_bool_continue_on_error now asserts errors == [] instead of "no error mentions continue_on_error". Unrelated validation regressions on the same minimal YAML can no longer slip past this test.

3. Gate stdin patching made runner-robust. In test_gate_abort_still_halts_with_continue_on_error, swapped sys.stdin itself for a stub _TTYStdin object instead of patching sys.stdin.isatty. Method-on-instance assignment is unreliable on real io.TextIOWrapper objects (e.g. under pytest with capture disabled), so replacing the whole stdin object is more robust across runners.

Full suite still passes (continue_on_error: 7/7), no regressions. Branch is MERGEABLE and ready for another look whenever you have a moment.

AI disclosure: drafted with Claude Opus, human-reviewed.

Closes github#2591. Adds an optional `continue_on_error: bool` field on every step. When set to `true` and the step fails, the engine records the result (exit_code, stderr, status) into `steps.<id>.output` and continues to the next sibling step instead of halting the run. Downstream `if`, `switch`, or `gate` steps can then branch on `{{ steps.<id>.output.exit_code }}` to route the recovery path. This composes with primitives that already exist (the exit code is already captured, the expression engine already resolves it, and `if`/`switch`/`gate` are already available) — the only gap was that a non-zero exit hard-stopped the pipeline before any downstream step could evaluate it. ### Engine `WorkflowEngine._execute_steps` now consults the step config when a step returns `StepStatus.FAILED`: - Gate aborts (`output.aborted`) always halt the run — operator decisions take precedence over the flag. - Otherwise, if `continue_on_error: true`, log a `step_continue_on_error` event and proceed to the next sibling. - Otherwise, behave as before: set `RunStatus.FAILED` and return. ### Validation `_validate_steps` rejects non-bool values for `continue_on_error`. Coerced strings like `"true"` are not accepted so authoring mistakes surface at validation time rather than silently changing run semantics. ### Default behaviour preserved When `continue_on_error` is omitted, every code path is byte-equivalent to before this change. Existing workflows see no difference. ### Tests New `TestContinueOnError` class in `tests/test_workflows.py` covers all four scenarios from the issue's acceptance criteria plus two extras: - undeclared (default) failure halts the run. - declared-and-fired continues past the failure. - declared-but-step-succeeded is a no-op (flag only matters on FAILED). - if-branch end-to-end exercising the canonical recovery pattern from the issue discussion. - gate abort still halts even with `continue_on_error: true` set. - validation rejects non-bool values; accepts both `true` and `false` cleanly. ### Docs Adds an "Error Handling" section to `workflows/README.md` documenting the field, the gate-abort precedence rule, and the canonical recovery pattern. ### Follow-on Auto-retry-on-transient (e.g. retry a 429 at 3 AM without operator attendance) is intentionally out of scope. The current proposal covers the **skip** and **abort** verdicts from the original discussion; the **retry** verdict still pauses for an operator at the gate step. A future loop/retry-count primitive or an auto-approving gate could close that gap on top of this mechanism without further engine changes.

- Reword README "Error Handling" intro in terms of `StepStatus.FAILED` halting by default, with non-zero shell/command exit as one common cause. Avoids implying only exit codes can halt a run (gate aborts and validation failures also do, just via different mechanisms). - Tighten `test_validation_accepts_bool_continue_on_error` to assert `errors == []` instead of "no error mentions continue_on_error", so unrelated validation regressions on the same minimal YAML can no longer slip past this test. - In `test_gate_abort_still_halts_with_continue_on_error`, swap `sys.stdin` itself for a stub `_TTYStdin` instead of patching `sys.stdin.isatty`. Method-on-instance assignment is unreliable on real `io.TextIOWrapper` objects (e.g. under pytest with capture disabled), so replacing the whole stdin object is more robust across runners. All 2967 tests still pass.

doquanghuy requested a review from mnriem as a code owner May 21, 2026 14:34

doquanghuy force-pushed the feat/continue-on-error branch 2 times, most recently from f34ab4c to da8ed4d Compare May 21, 2026 14:48

mnriem requested a review from Copilot May 26, 2026 12:12

Copilot started reviewing on behalf of mnriem May 26, 2026 12:20 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

mnriem requested a review from Copilot May 27, 2026 12:05

Copilot started reviewing on behalf of mnriem May 27, 2026 12:06 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

doquanghuy added 2 commits May 27, 2026 19:56

doquanghuy force-pushed the feat/continue-on-error branch from e9a4871 to a0e78ee Compare May 27, 2026 12:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(workflows): add continue_on_error step field for non-halting failures#2663

feat(workflows): add continue_on_error step field for non-halting failures#2663
doquanghuy wants to merge 2 commits into
github:mainfrom
doquanghuy:feat/continue-on-error

doquanghuy commented May 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

doquanghuy commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		monkeypatch.setattr(gate_module.sys.stdin, "isatty", lambda: True)
		monkeypatch.setattr(

Conversation

doquanghuy commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Canonical usage

Engine

Validation

Default behaviour preserved

Verdict coverage (from the issue discussion)

Testing

New test coverage

AI Disclosure

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

doquanghuy commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

doquanghuy commented May 21, 2026 •

edited

Loading

doquanghuy commented May 27, 2026 •

edited

Loading