Skip to content

Commit f34ab4c

Browse files
committed
feat(workflows): add continue_on_error step field
Closes #2591. Adds an optional `continue_on_error: bool` field on every step. When set to `true` and the step fails, the engine records the result (exit_code, stderr, status) into `steps.<id>.output` and continues to the next sibling step instead of halting the run. Downstream `if`, `switch`, or `gate` steps can then branch on `{{ steps.<id>.output.exit_code }}` to route the recovery path. This composes with primitives that already exist (the exit code is already captured, the expression engine already resolves it, and `if`/`switch`/`gate` are already available) — the only gap was that a non-zero exit hard-stopped the pipeline before any downstream step could evaluate it. ### Engine `WorkflowEngine._execute_steps` now consults the step config when a step returns `StepStatus.FAILED`: - Gate aborts (`output.aborted`) always halt the run — operator decisions take precedence over the flag. - Otherwise, if `continue_on_error: true`, log a `step_continue_on_error` event and proceed to the next sibling. - Otherwise, behave as before: set `RunStatus.FAILED` and return. ### Validation `_validate_steps` rejects non-bool values for `continue_on_error`. Coerced strings like `"true"` are not accepted so authoring mistakes surface at validation time rather than silently changing run semantics. ### Default behaviour preserved When `continue_on_error` is omitted, every code path is byte-equivalent to before this change. Existing workflows see no difference. ### Tests New `TestContinueOnError` class in `tests/test_workflows.py` covers all four scenarios from the issue's acceptance criteria plus two extras: - undeclared (default) failure halts the run. - declared-and-fired continues past the failure. - declared-but-step-succeeded is a no-op (flag only matters on FAILED). - if-branch end-to-end exercising the canonical recovery pattern from the issue discussion. - gate abort still halts even with `continue_on_error: true` set. - validation rejects non-bool values; accepts both `true` and `false` cleanly. ### Docs Adds an "Error Handling" section to `workflows/README.md` documenting the field, the gate-abort precedence rule, and the canonical recovery pattern. ### Follow-on Auto-retry-on-transient (e.g. retry a 429 at 3 AM without operator attendance) is intentionally out of scope. The current proposal covers the **skip** and **abort** verdicts from the original discussion; the **retry** verdict still pauses for an operator at the gate step. A future loop/retry-count primitive or an auto-approving gate could close that gap on top of this mechanism without further engine changes.
1 parent 1bf4a6e commit f34ab4c

3 files changed

Lines changed: 329 additions & 4 deletions

File tree

src/specify_cli/workflows/engine.py

Lines changed: 40 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -231,6 +231,20 @@ def _validate_steps(
231231
step_errors = step_impl.validate(step_config)
232232
errors.extend(step_errors)
233233

234+
# Validate optional `continue_on_error` field. The engine honours
235+
# this on any step that returns FAILED so the pipeline can route
236+
# around the failure via downstream `if`/`switch`/`gate`. The
237+
# field must be a literal boolean — coercion from truthy strings
238+
# is deliberately not supported so authoring mistakes surface
239+
# at validation time rather than silently changing run semantics.
240+
if "continue_on_error" in step_config:
241+
coe = step_config["continue_on_error"]
242+
if not isinstance(coe, bool):
243+
errors.append(
244+
f"Step {step_id!r}: 'continue_on_error' must be a "
245+
f"boolean, got {type(coe).__name__}."
246+
)
247+
234248
# Recursively validate nested steps
235249
for nested_key in ("then", "else", "steps"):
236250
nested = step_config.get(nested_key)
@@ -622,7 +636,10 @@ def _execute_steps(
622636

623637
# Handle failures
624638
if result.status == StepStatus.FAILED:
625-
# Gate abort (output.aborted) maps to ABORTED status
639+
# Gate abort (output.aborted) maps to ABORTED status.
640+
# Aborts are deliberate operator decisions, so
641+
# `continue_on_error` does NOT override them — that flag
642+
# is for transient/expected step failures only.
626643
if result.output.get("aborted"):
627644
state.status = RunStatus.ABORTED
628645
state.append_log(
@@ -631,15 +648,34 @@ def _execute_steps(
631648
"step_id": step_id,
632649
}
633650
)
634-
else:
635-
state.status = RunStatus.FAILED
651+
state.save()
652+
return
653+
654+
state.append_log(
655+
{
656+
"event": "step_failed",
657+
"step_id": step_id,
658+
"error": result.error,
659+
}
660+
)
661+
662+
# `continue_on_error: true` lets the pipeline route
663+
# around the failure instead of halting. The step
664+
# result (including exit_code, stderr, status) is
665+
# still recorded so downstream `if`/`switch`/`gate`
666+
# steps can branch on it.
667+
if step_config.get("continue_on_error"):
636668
state.append_log(
637669
{
638-
"event": "step_failed",
670+
"event": "step_continue_on_error",
639671
"step_id": step_id,
640672
"error": result.error,
641673
}
642674
)
675+
state.save()
676+
continue
677+
678+
state.status = RunStatus.FAILED
643679
state.save()
644680
return
645681

tests/test_workflows.py

Lines changed: 250 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1890,6 +1890,256 @@ def test_validate_workflow_rejects_non_string_default_for_string_type(self):
18901890
assert any("invalid default" in e for e in errors), errors
18911891

18921892

1893+
# ===== continue_on_error Tests =====
1894+
#
1895+
# Locks the contract documented in workflows/README.md "Error Handling"
1896+
# section: when an executable step fails and `continue_on_error: true`
1897+
# is declared, the engine records the result (exit_code, stderr, status)
1898+
# and continues to the next sibling step instead of halting the run.
1899+
# Gate aborts (`output.aborted`) still halt regardless of the flag.
1900+
1901+
1902+
class TestContinueOnError:
1903+
"""Test the `continue_on_error` step-level field."""
1904+
1905+
def test_undeclared_failure_halts_run(self, project_dir):
1906+
"""Default behaviour (no `continue_on_error`): a failing step
1907+
halts the workflow run with `status == FAILED`.
1908+
1909+
Locks the byte-equivalent default — workflows that do not
1910+
declare the flag must behave exactly as before this feature.
1911+
"""
1912+
from specify_cli.workflows.engine import WorkflowDefinition, WorkflowEngine
1913+
from specify_cli.workflows.base import RunStatus
1914+
1915+
definition = WorkflowDefinition.from_string("""
1916+
schema_version: "1.0"
1917+
workflow:
1918+
id: "halt-on-fail"
1919+
name: "Halt On Fail"
1920+
version: "1.0.0"
1921+
steps:
1922+
- id: fail-step
1923+
type: shell
1924+
run: "exit 7"
1925+
- id: after
1926+
type: shell
1927+
run: "echo should-not-run"
1928+
""")
1929+
engine = WorkflowEngine(project_dir)
1930+
state = engine.execute(definition)
1931+
1932+
assert state.status == RunStatus.FAILED
1933+
assert "fail-step" in state.step_results
1934+
assert state.step_results["fail-step"]["output"]["exit_code"] == 7
1935+
# Subsequent step never executes when the flag is absent.
1936+
assert "after" not in state.step_results
1937+
1938+
def test_declared_and_fired_continues_run(self, project_dir):
1939+
"""`continue_on_error: true` + failing step: the run keeps
1940+
going, the failed step's result is recorded, and the
1941+
downstream step runs.
1942+
"""
1943+
from specify_cli.workflows.engine import WorkflowDefinition, WorkflowEngine
1944+
from specify_cli.workflows.base import RunStatus
1945+
1946+
definition = WorkflowDefinition.from_string("""
1947+
schema_version: "1.0"
1948+
workflow:
1949+
id: "continue-past-fail"
1950+
name: "Continue Past Fail"
1951+
version: "1.0.0"
1952+
steps:
1953+
- id: flaky-step
1954+
type: shell
1955+
run: "exit 42"
1956+
continue_on_error: true
1957+
- id: after
1958+
type: shell
1959+
run: "echo did-run"
1960+
""")
1961+
engine = WorkflowEngine(project_dir)
1962+
state = engine.execute(definition)
1963+
1964+
assert state.status == RunStatus.COMPLETED
1965+
# Failed step's exit_code is preserved so downstream branching
1966+
# can inspect it.
1967+
assert state.step_results["flaky-step"]["output"]["exit_code"] == 42
1968+
assert state.step_results["flaky-step"]["status"] == "failed"
1969+
# Downstream step ran successfully.
1970+
assert state.step_results["after"]["output"]["exit_code"] == 0
1971+
1972+
def test_declared_but_step_succeeded_is_noop(self, project_dir):
1973+
"""`continue_on_error: true` on a step that succeeds is a
1974+
no-op — the flag only changes behaviour on FAILED status.
1975+
"""
1976+
from specify_cli.workflows.engine import WorkflowDefinition, WorkflowEngine
1977+
from specify_cli.workflows.base import RunStatus
1978+
1979+
definition = WorkflowDefinition.from_string("""
1980+
schema_version: "1.0"
1981+
workflow:
1982+
id: "flag-but-success"
1983+
name: "Flag But Success"
1984+
version: "1.0.0"
1985+
steps:
1986+
- id: ok-step
1987+
type: shell
1988+
run: "echo ok"
1989+
continue_on_error: true
1990+
- id: after
1991+
type: shell
1992+
run: "echo done"
1993+
""")
1994+
engine = WorkflowEngine(project_dir)
1995+
state = engine.execute(definition)
1996+
1997+
assert state.status == RunStatus.COMPLETED
1998+
assert state.step_results["ok-step"]["status"] == "completed"
1999+
assert state.step_results["ok-step"]["output"]["exit_code"] == 0
2000+
assert state.step_results["after"]["output"]["exit_code"] == 0
2001+
2002+
def test_if_branch_routes_around_failure(self, project_dir):
2003+
"""End-to-end: `continue_on_error` + `if` cleanly routes around
2004+
a failure. The recovery branch runs; the success branch does
2005+
not.
2006+
2007+
Mirrors the canonical usage pattern from the original feature
2008+
discussion in issue #2591.
2009+
"""
2010+
from specify_cli.workflows.engine import WorkflowDefinition, WorkflowEngine
2011+
from specify_cli.workflows.base import RunStatus
2012+
2013+
definition = WorkflowDefinition.from_string("""
2014+
schema_version: "1.0"
2015+
workflow:
2016+
id: "route-around"
2017+
name: "Route Around Failure"
2018+
version: "1.0.0"
2019+
steps:
2020+
- id: heavy-thing
2021+
type: shell
2022+
run: "exit 1"
2023+
continue_on_error: true
2024+
- id: check-result
2025+
type: if
2026+
condition: "{{ steps.heavy-thing.output.exit_code != 0 }}"
2027+
then:
2028+
- id: recovery
2029+
type: shell
2030+
run: "echo recovery-ran"
2031+
else:
2032+
- id: happy-path
2033+
type: shell
2034+
run: "echo happy-path-ran"
2035+
""")
2036+
engine = WorkflowEngine(project_dir)
2037+
state = engine.execute(definition)
2038+
2039+
assert state.status == RunStatus.COMPLETED
2040+
assert "recovery" in state.step_results
2041+
assert "happy-path" not in state.step_results
2042+
2043+
def test_gate_abort_still_halts_with_continue_on_error(
2044+
self, project_dir, monkeypatch
2045+
):
2046+
"""`continue_on_error` does NOT override a deliberate gate
2047+
abort. `output.aborted` always halts the run with
2048+
`status == ABORTED`.
2049+
2050+
Aborts are explicit operator decisions; continue_on_error
2051+
is for transient/expected step failures only.
2052+
"""
2053+
from specify_cli.workflows.engine import WorkflowDefinition, WorkflowEngine
2054+
from specify_cli.workflows.base import RunStatus
2055+
from specify_cli.workflows.steps.gate import GateStep
2056+
from specify_cli.workflows.steps import gate as gate_module
2057+
2058+
# Force the gate step into interactive mode and feed a "reject"
2059+
# choice so the abort path actually runs in the test env
2060+
# (default behaviour returns PAUSED when stdin is not a TTY).
2061+
monkeypatch.setattr(gate_module.sys.stdin, "isatty", lambda: True)
2062+
monkeypatch.setattr(
2063+
GateStep, "_prompt", staticmethod(lambda _msg, _opts: "reject")
2064+
)
2065+
2066+
definition = WorkflowDefinition.from_string("""
2067+
schema_version: "1.0"
2068+
workflow:
2069+
id: "gate-abort-halts"
2070+
name: "Gate Abort Halts"
2071+
version: "1.0.0"
2072+
steps:
2073+
- id: gate-step
2074+
type: gate
2075+
message: "Approve?"
2076+
options: [approve, reject]
2077+
on_reject: abort
2078+
continue_on_error: true
2079+
- id: should-not-run
2080+
type: shell
2081+
run: "echo nope"
2082+
""")
2083+
engine = WorkflowEngine(project_dir)
2084+
state = engine.execute(definition)
2085+
2086+
assert state.status == RunStatus.ABORTED
2087+
assert "should-not-run" not in state.step_results
2088+
2089+
def test_validation_rejects_non_bool_continue_on_error(self):
2090+
"""`continue_on_error` must be a literal boolean; coerced
2091+
strings like `"true"` are rejected at validation time so
2092+
authoring mistakes surface before execution.
2093+
"""
2094+
from specify_cli.workflows.engine import (
2095+
WorkflowDefinition,
2096+
validate_workflow,
2097+
)
2098+
2099+
definition = WorkflowDefinition.from_string("""
2100+
schema_version: "1.0"
2101+
workflow:
2102+
id: "bad-coe"
2103+
name: "Bad COE"
2104+
version: "1.0.0"
2105+
steps:
2106+
- id: step-one
2107+
type: shell
2108+
run: "true"
2109+
continue_on_error: "true"
2110+
""")
2111+
errors = validate_workflow(definition)
2112+
assert any(
2113+
"continue_on_error" in e and "boolean" in e for e in errors
2114+
), errors
2115+
2116+
def test_validation_accepts_bool_continue_on_error(self):
2117+
"""Boolean values pass validation cleanly."""
2118+
from specify_cli.workflows.engine import (
2119+
WorkflowDefinition,
2120+
validate_workflow,
2121+
)
2122+
2123+
for value in (True, False):
2124+
yaml_value = "true" if value else "false"
2125+
definition = WorkflowDefinition.from_string(f"""
2126+
schema_version: "1.0"
2127+
workflow:
2128+
id: "good-coe"
2129+
name: "Good COE"
2130+
version: "1.0.0"
2131+
steps:
2132+
- id: step-one
2133+
type: shell
2134+
run: "true"
2135+
continue_on_error: {yaml_value}
2136+
""")
2137+
errors = validate_workflow(definition)
2138+
assert not any(
2139+
"continue_on_error" in e for e in errors
2140+
), errors
2141+
2142+
18932143
# ===== State Persistence Tests =====
18942144

18952145
class TestRunState:

workflows/README.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -219,6 +219,45 @@ Aggregate results from fan-out steps:
219219
output: {}
220220
```
221221

222+
## Error Handling
223+
224+
By default, a non-zero exit code from any step halts the entire run.
225+
Set `continue_on_error: true` on a step to record its result and
226+
continue to the next sibling step instead. The exit code remains
227+
available on `steps.<id>.output.exit_code` so downstream `if`,
228+
`switch`, or `gate` steps can branch on it:
229+
230+
```yaml
231+
- id: heavy-thing
232+
type: command
233+
integration: claude
234+
command: speckit.heavy-thing
235+
continue_on_error: true
236+
237+
- id: check-result
238+
type: if
239+
condition: "{{ steps.heavy-thing.output.exit_code != 0 }}"
240+
then:
241+
- id: review
242+
type: gate
243+
message: "Step failed (exit {{ steps.heavy-thing.output.exit_code }}). Retry or skip?"
244+
on_reject: skip
245+
else:
246+
- id: next-thing
247+
command: speckit.next-thing
248+
```
249+
250+
**Notes:**
251+
252+
- The field must be a literal boolean (`true` / `false`); coerced
253+
strings like `"true"` are rejected at validation time.
254+
- Gate aborts (`on_reject: abort` chosen by the operator) always halt
255+
the run — `continue_on_error` does not override them. The flag is
256+
for transient/expected step failures, not for overriding deliberate
257+
operator decisions.
258+
- When the flag is omitted, behaviour is byte-equivalent to before
259+
this feature.
260+
222261
## Expressions
223262

224263
Workflow definitions use `{{ expression }}` syntax for dynamic values:

0 commit comments

Comments
 (0)