Skip to content

[spark-compete] fix(security): keep revoke-all best-effort when a module manifest is broken#751

Open
banse wants to merge 1 commit into
vibeforge1111:masterfrom
banse:fix/revoke-all-best-effort-survives-module-systemexit
Open

[spark-compete] fix(security): keep revoke-all best-effort when a module manifest is broken#751
banse wants to merge 1 commit into
vibeforge1111:masterfrom
banse:fix/revoke-all-best-effort-survives-module-systemexit

Conversation

@banse
Copy link
Copy Markdown
Contributor

@banse banse commented Jun 2, 2026

packet

Schema: spark-compete-hotfix-v1 · Event: spark-compete-first-event · Submission: public_repo_pr

team

The Dudes — His Dudeness, El Duderino, Duder. LLM device holder: His Dudeness (github banse). GitHub accounts: banse.

pr_author

banse

repo

vibeforge1111/spark-cli

actual_behavior

resolve_installed_modules_best_effort() exists to never fail because of a broken installed module — it wraps resolve_installed_modules() in try/except Exception: return {}. A recent R24 commit (d2c6bc2, "Adopt R24 CLI provider and module reliability fixes") changed load_module to raise SystemExit(...) on a missing or corrupt module manifest. SystemExit is a BaseException, not an Exception, so it slips straight through the except Exception guard. As a result, a single installed module with a missing/corrupt spark.toml aborts callers that are supposed to degrade gracefully. The worst-affected caller is spark security revoke-all: it calls rotate_revoke_all_env_keys() and disable_revoke_all_custom_mcp() directly, both of which call resolve_installed_modules_best_effort(). So the emergency credential-revocation command crashes with a raw SystemExit after deleting stored secrets but before disabling custom MCP, pausing missions, or writing the support bundle — defeating the command's deliberate per-step fault tolerance (every other step returns {ok: false, error} instead of crashing).

expected_behavior

resolve_installed_modules_best_effort() must honor its contract and return {} on any module-resolution failure, including the SystemExit that load_module now raises, so a single broken module manifest cannot abort spark security revoke-all (or any other best-effort caller).

repro_steps

  1. Register one installed module whose directory has no spark.toml (e.g. installed.json = {"ghost": {"path": "/path/with/no/spark.toml"}}).
  2. Call resolve_installed_modules_best_effort() (directly, or via spark security revoke-all).
  3. On current master it raises SystemExit('Module manifest not found: …/spark.toml') instead of returning {}; in revoke-all this aborts the command mid-flow.
  4. Deterministic: python -m pytest tests/test_cli.py -k resolve_installed_modules_best_effort_survives_broken_manifest — fails before the fix (helper leaks SystemExit), passes after.

before_after_proof

BEFORE (clean master eda756e): the new test fails — the assertRaises(SystemExit) on the raw resolve_installed_modules() passes, then resolve_installed_modules_best_effort() also raises SystemExit('Module manifest not found: …/ghost-module/spark.toml') instead of returning {} (issubclass(SystemExit, Exception) == False). AFTER (this PR): the helper returns {}; the test passes. Full-suite baseline comparison on the same machine: clean master = 33 failed / 623 passed; with-fix = 33 failed / 624 passed (identical pre-existing env failures — macOS /tmp symlink + TCC sandbox — and exactly +1 new passing test, 0 new failures).

tests_or_smoke

python -m pytest tests/test_cli.py -k resolve_installed_modules_best_effort_survives_broken_manifest (fails pre-fix, passes post-fix). Full tests/test_cli.py: +1 passing test, no new failures vs clean master (33 == 33 pre-existing env failures).

duplicate_notes

No open PR/issue addresses the best-effort/BaseException interaction. Re-checked open spark-cli PRs by ROOT CAUSE: #277/#592/#658 make load_module itself report missing/malformed manifests — #277 in fact proposed the SystemExit that d2c6bc2 adopted; my finding is the inverse, that the adopted SystemExit defeats the pre-existing except Exception best-effort guard and aborts revoke-all. #745 only adds logging to silent handlers (different root cause). Root cause was introduced by d2c6bc2 (confirmed ancestor of canonical master eda756e), with no follow-up fix.

risk_notes

Low risk, security-positive. One-line change: broadens the except in the resolve_installed_modules_best_effort helper from Exception to (Exception, SystemExit), restoring the helper's documented "degrade to {}" contract. It strictly adds fault tolerance — it cannot newly suppress a build or swallow a result the helper previously returned (the success path is unchanged). load_module's own SystemExit behavior on direct callers (e.g. spark start) is untouched. No installer/CI/dependency/secret surface modified.

review_claim

Impact: medium (breaks fault tolerance of a security-emergency command). Evidence: failing_test, passing_test, redacted_terminal_excerpt. Requested review state: pr_review. Scope: the resolve_installed_modules_best_effort helper only; load_module's raising behavior and other callers are intentionally unchanged.

…ule manifest is broken

resolve_installed_modules_best_effort() only caught `except Exception`, but
load_module now raises SystemExit (a BaseException) on a missing/corrupt module
manifest. SystemExit escaped the guard, so a single broken installed module
aborted `spark security revoke-all` mid-flow -- after deleting secrets, before
disabling custom MCP / pausing missions / writing the support bundle -- defeating
the command's deliberate per-step fault tolerance. Catch SystemExit too so the
helper degrades to {} as intended. Adds a regression test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant