Skip to content

audit: patch the holes you could drive a bus through#60

Merged
pi3-code merged 2 commits intomainfrom
audit/production-hardening
Feb 27, 2026
Merged

audit: patch the holes you could drive a bus through#60
pi3-code merged 2 commits intomainfrom
audit/production-hardening

Conversation

@Treelovah
Copy link
Contributor

Summary

Production hardening sweep — the kind of cleanup you do when you realize the codebase has been getting away with things.

The falsy-zero bug: overallScore || fallback() treats 0 as falsy. Every run where the LLM returned a legitimate score of zero got silently replaced with a fallback average. Two characters (||??), months of quiet data corruption fixed. Same for decisionQuality.

Type safety: Replaced every any type and unsafe as cast at trust boundaries with Zod validation schemas and proper type guards. LLM responses, tool inputs, error objects — all of it. catch (error: any) is gone.

Deduplication: shellEscape (3 copies → 1), provider normalization (2 identical alias maps → 1), run-ID validation (2 regexes → 1), provider URLs (redundant map deleted, reads from PROVIDERS registry directly).

New infrastructure: Error class hierarchy (OasisErrorConfigError/DockerError/ValidationError/AnalysisError), named constants replacing 15+ magic numbers, message array sliding window (bounded memory), Docker exec retry with backoff.

27 new tests (363 → 390): malformed LLM response parsing, shell escape injection attempts, error extraction edge cases.

Files

New Purpose
src/lib/schemas.ts Zod schemas + typed JsonReport
src/lib/errors.ts OasisError hierarchy
src/lib/shell.ts Shared shellEscape
src/lib/constants.ts Named constants
tests/unit/shell.test.ts Shell escape tests

Test plan

  • npm run build — clean
  • npm test — 390/390 passing
  • Smoke test: npx oasis --help, npx oasis challenges

cc @pi3-code

Turns out if you squint at the codebase long enough, it squints back
with `any` types and copy-pasted shell escaping. Time to fix that
before it fixes us.

The crown jewel: `overallScore || fallback()`. Looks fine until you
remember that 0 is falsy in JavaScript. So every run where the LLM
returned a perfectly legitimate score of zero got its result quietly
tossed in the garbage and replaced with a fallback average. Two
characters — `||` to `??` — and suddenly months of KSM scores make
sense again. You could drive a truck through this bug and nobody
would notice because the fallback was "close enough." decisionQuality
had the same problem. Wonderful.

The `any` types were doing that thing where they make TypeScript
feel like JavaScript with extra steps. LLM responses parsed with
JSON.parse() and immediately trusted via `as { command: string }` —
because why would an LLM ever return something unexpected, right?
Every catch block was `catch (error: any)` followed by a prayer.
Replaced it all with Zod schemas at trust boundaries and actual type
guards for error handling, like civilized people.

Three files had their own artisanal copy of shellEscape(). Two files
had their own provider normalization with identical alias maps. Two
regex patterns validated run IDs independently. A DEFAULT_PROVIDER_URLS
map existed purely to disagree with the PROVIDERS registry it was
supposed to mirror. Consolidated everything — one function, one place,
one truth.

Also: error class hierarchy (so you can tell "config broke" from
"docker broke" without reading tea leaves), 15+ magic numbers
extracted into named constants, a sliding window on message arrays
so 50-iteration runs don't eat your RAM for lunch, and Docker exec
retry with backoff because sometimes the daemon just needs a moment.

27 new tests, 390 total, all green. The shellEscape tests include
injection attempts because trust is earned.
@Treelovah Treelovah requested a review from pi3-code February 27, 2026 17:19
Copy link
Contributor

@pi3-code pi3-code left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really solid hardening pass. The falsy-zero fix alone is worth the PR — that's been silently corrupting scores for a while. The Zod schemas at trust boundaries, shellEscape dedup, error hierarchy, and constants extraction are all things this codebase needed. The 27 new tests are substantive (malformed LLM output, shell injection attempts), not filler. Nice work.

I pulled the branch locally, built it, and wrote repro tests for the issues below. Three of them will cause runtime failures, so I'm requesting changes on those. Everything else is a suggestion or a nit.


Blocking — will fail in production

1. Claude agent skips tool_result on invalid tool input (runner.ts:455-459)

When ToolInputSchema.safeParse fails, the code does continue without pushing a tool_result message. The Anthropic Messages API requires a tool_result for every tool_use block — the next API call will reject with a validation error. The OpenAI path (line ~688) already handles this correctly by sending an error message back to the model. The Claude path needs the same pattern.

Suggested fix:

if (!toolInput.success) {
  messages.push({
    role: 'user',
    content: [{
      type: 'tool_result',
      tool_use_id: block.id,
      content: `Error: invalid tool input: ${JSON.stringify(block.input)}`,
      is_error: true,
    }],
  });
  continue;
}

2. trimMessages will break Anthropic role alternation (runner.ts:32-35)

I reproduced this locally: with 43 messages in a user/assistant/user/assistant sequence, after trimming, msg[0] (user) is followed by msg[4] (also user) — adjacent same-role messages. Anthropic's API rejects this. Needs to either trim in user/assistant pairs, or merge adjacent same-role messages after trimming. Would also benefit from unit tests since it's not exported or tested currently.

3. Plain object throw loses error message (env-check.ts:321)

throw { status: 401, message: '...' } is caught by the outer catch at line 333, which does error instanceof Error ? error.message : 'Unknown error'. Since a plain object isn't an Error instance, users see "Unknown error" instead of "Incorrect API key". This is a good candidate for the new ConfigError class from this same PR.

throw new ConfigError(typeof vErr.message === 'string' ? vErr.message : 'Incorrect API key');

Worth fixing in this PR

4. buildRubricScore still uses || 0 (analyzer.ts:401/406/411)

Same falsy-zero bug class that this branch fixes for overallScore and decisionQuality. reconQuality?.score || 0, techniqueSelection?.score || 0, adaptability?.score || 0 should be ?? 0. Currently the fallback is also 0 so the behavior is the same, but it's inconsistent with the fix pattern in the rest of the PR and will bite if the fallback ever changes.

5. ToolInputSchema accepts empty strings (schemas.ts:80)

z.string() accepts "". An empty command passes validation and sends empty stdin to docker exec -i container bash, which hangs for the full 60s timeout. z.string().min(1) would catch this.

6. Docker retry doesn't distinguish transient from permanent errors (runner.ts:189-215)

executeCommand retries all failures 3 times with sleep backoff. A genuine command syntax error (exit code 1, predictable stderr) will waste ~3s before returning the same error. Might be worth checking if the error is Docker-connectivity related before retrying.

7. env-check.ts error extraction could reuse getErrorStatus() (env-check.ts:278/303/334)

Three verbose inline error-status extraction blocks duplicate what getErrorStatus() from retry.ts already does. Would cut ~15 lines of dense type-casting boilerplate.


Overall this is a strong PR. The three blocking items are straightforward fixes — happy to re-review once those are addressed.

Blocking:
- runner: push assistant + error tool_result on invalid input before
  continue, preventing Claude API role-alternation violation
- runner: export trimMessages with generic constraint; use while-loop
  to drop all consecutive same-role messages at trim boundary
- env-check: replace plain object throw with ConfigError so
  getErrorStatus() routes to the 401 branch correctly

Worth fixing:
- schemas: reject empty command strings (z.string().min(1))
- analyzer: use ?? instead of || for rubric scores so 0 isn't falsy
- env-check: reuse getErrorStatus() in all 3 catch blocks, remove
  unused fs/path imports
- runner: add isDockerTransientError() to skip retries on permanent
  failures, remove unused DockerError import

Tests: 7 new trimMessages tests (397 total, all passing)
@Treelovah
Copy link
Contributor Author

All 7 issues addressed in your commit (3534f2f). Ran 4 full benchmarks post-fix — Sonnet 4.5, Gemini 2.5 Pro, Grok 4, and o3 against proxy-auth-bypass. Runner is solid across all providers, trimMessages survives well past the old iteration 21 crash point, no API errors. 397 tests passing. Ready for your re-approval so we can merge.

@pi3-code pi3-code merged commit 237aa6d into main Feb 27, 2026
4 checks passed
@Treelovah Treelovah deleted the audit/production-hardening branch March 2, 2026 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants