audit: patch the holes you could drive a bus through by Treelovah · Pull Request #60 · KryptSec/oasis

Treelovah · 2026-02-27T14:48:10Z

Summary

Production hardening sweep — the kind of cleanup you do when you realize the codebase has been getting away with things.

The falsy-zero bug: overallScore || fallback() treats 0 as falsy. Every run where the LLM returned a legitimate score of zero got silently replaced with a fallback average. Two characters (|| → ??), months of quiet data corruption fixed. Same for decisionQuality.

Type safety: Replaced every any type and unsafe as cast at trust boundaries with Zod validation schemas and proper type guards. LLM responses, tool inputs, error objects — all of it. catch (error: any) is gone.

Deduplication: shellEscape (3 copies → 1), provider normalization (2 identical alias maps → 1), run-ID validation (2 regexes → 1), provider URLs (redundant map deleted, reads from PROVIDERS registry directly).

New infrastructure: Error class hierarchy (OasisError → ConfigError/DockerError/ValidationError/AnalysisError), named constants replacing 15+ magic numbers, message array sliding window (bounded memory), Docker exec retry with backoff.

27 new tests (363 → 390): malformed LLM response parsing, shell escape injection attempts, error extraction edge cases.

Files

New	Purpose
`src/lib/schemas.ts`	Zod schemas + typed JsonReport
`src/lib/errors.ts`	OasisError hierarchy
`src/lib/shell.ts`	Shared shellEscape
`src/lib/constants.ts`	Named constants
`tests/unit/shell.test.ts`	Shell escape tests

Test plan

npm run build — clean
npm test — 390/390 passing
Smoke test: npx oasis --help, npx oasis challenges

cc @pi3-code

Turns out if you squint at the codebase long enough, it squints back with `any` types and copy-pasted shell escaping. Time to fix that before it fixes us. The crown jewel: `overallScore || fallback()`. Looks fine until you remember that 0 is falsy in JavaScript. So every run where the LLM returned a perfectly legitimate score of zero got its result quietly tossed in the garbage and replaced with a fallback average. Two characters — `||` to `??` — and suddenly months of KSM scores make sense again. You could drive a truck through this bug and nobody would notice because the fallback was "close enough." decisionQuality had the same problem. Wonderful. The `any` types were doing that thing where they make TypeScript feel like JavaScript with extra steps. LLM responses parsed with JSON.parse() and immediately trusted via `as { command: string }` — because why would an LLM ever return something unexpected, right? Every catch block was `catch (error: any)` followed by a prayer. Replaced it all with Zod schemas at trust boundaries and actual type guards for error handling, like civilized people. Three files had their own artisanal copy of shellEscape(). Two files had their own provider normalization with identical alias maps. Two regex patterns validated run IDs independently. A DEFAULT_PROVIDER_URLS map existed purely to disagree with the PROVIDERS registry it was supposed to mirror. Consolidated everything — one function, one place, one truth. Also: error class hierarchy (so you can tell "config broke" from "docker broke" without reading tea leaves), 15+ magic numbers extracted into named constants, a sliding window on message arrays so 50-iteration runs don't eat your RAM for lunch, and Docker exec retry with backoff because sometimes the daemon just needs a moment. 27 new tests, 390 total, all green. The shellEscape tests include injection attempts because trust is earned.

pi3-code

Really solid hardening pass. The falsy-zero fix alone is worth the PR — that's been silently corrupting scores for a while. The Zod schemas at trust boundaries, shellEscape dedup, error hierarchy, and constants extraction are all things this codebase needed. The 27 new tests are substantive (malformed LLM output, shell injection attempts), not filler. Nice work.

I pulled the branch locally, built it, and wrote repro tests for the issues below. Three of them will cause runtime failures, so I'm requesting changes on those. Everything else is a suggestion or a nit.

Blocking — will fail in production

1. Claude agent skips tool_result on invalid tool input (runner.ts:455-459)

When ToolInputSchema.safeParse fails, the code does continue without pushing a tool_result message. The Anthropic Messages API requires a tool_result for every tool_use block — the next API call will reject with a validation error. The OpenAI path (line ~688) already handles this correctly by sending an error message back to the model. The Claude path needs the same pattern.

Suggested fix:

if (!toolInput.success) {
  messages.push({
    role: 'user',
    content: [{
      type: 'tool_result',
      tool_use_id: block.id,
      content: `Error: invalid tool input: ${JSON.stringify(block.input)}`,
      is_error: true,
    }],
  });
  continue;
}

2. trimMessages will break Anthropic role alternation (runner.ts:32-35)

I reproduced this locally: with 43 messages in a user/assistant/user/assistant sequence, after trimming, msg[0] (user) is followed by msg[4] (also user) — adjacent same-role messages. Anthropic's API rejects this. Needs to either trim in user/assistant pairs, or merge adjacent same-role messages after trimming. Would also benefit from unit tests since it's not exported or tested currently.

3. Plain object throw loses error message (env-check.ts:321)

throw { status: 401, message: '...' } is caught by the outer catch at line 333, which does error instanceof Error ? error.message : 'Unknown error'. Since a plain object isn't an Error instance, users see "Unknown error" instead of "Incorrect API key". This is a good candidate for the new ConfigError class from this same PR.

throw new ConfigError(typeof vErr.message === 'string' ? vErr.message : 'Incorrect API key');

Worth fixing in this PR

4. buildRubricScore still uses || 0 (analyzer.ts:401/406/411)

Same falsy-zero bug class that this branch fixes for overallScore and decisionQuality. reconQuality?.score || 0, techniqueSelection?.score || 0, adaptability?.score || 0 should be ?? 0. Currently the fallback is also 0 so the behavior is the same, but it's inconsistent with the fix pattern in the rest of the PR and will bite if the fallback ever changes.

5. ToolInputSchema accepts empty strings (schemas.ts:80)

z.string() accepts "". An empty command passes validation and sends empty stdin to docker exec -i container bash, which hangs for the full 60s timeout. z.string().min(1) would catch this.

6. Docker retry doesn't distinguish transient from permanent errors (runner.ts:189-215)

executeCommand retries all failures 3 times with sleep backoff. A genuine command syntax error (exit code 1, predictable stderr) will waste ~3s before returning the same error. Might be worth checking if the error is Docker-connectivity related before retrying.

7. env-check.ts error extraction could reuse getErrorStatus() (env-check.ts:278/303/334)

Three verbose inline error-status extraction blocks duplicate what getErrorStatus() from retry.ts already does. Would cut ~15 lines of dense type-casting boilerplate.

Overall this is a strong PR. The three blocking items are straightforward fixes — happy to re-review once those are addressed.

Blocking: - runner: push assistant + error tool_result on invalid input before continue, preventing Claude API role-alternation violation - runner: export trimMessages with generic constraint; use while-loop to drop all consecutive same-role messages at trim boundary - env-check: replace plain object throw with ConfigError so getErrorStatus() routes to the 401 branch correctly Worth fixing: - schemas: reject empty command strings (z.string().min(1)) - analyzer: use ?? instead of || for rubric scores so 0 isn't falsy - env-check: reuse getErrorStatus() in all 3 catch blocks, remove unused fs/path imports - runner: add isDockerTransientError() to skip retries on permanent failures, remove unused DockerError import Tests: 7 new trimMessages tests (397 total, all passing)

Treelovah · 2026-02-27T20:29:44Z

All 7 issues addressed in your commit (3534f2f). Ran 4 full benchmarks post-fix — Sonnet 4.5, Gemini 2.5 Pro, Grok 4, and o3 against proxy-auth-bypass. Runner is solid across all providers, trimMessages survives well past the old iteration 21 crash point, no API errors. 397 tests passing. Ready for your re-approval so we can merge.

Treelovah requested a review from pi3-code February 27, 2026 17:19

pi3-code reviewed Feb 27, 2026

View reviewed changes

pi3-code approved these changes Feb 27, 2026

View reviewed changes

pi3-code merged commit 237aa6d into main Feb 27, 2026
4 checks passed

Treelovah deleted the audit/production-hardening branch March 2, 2026 18:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

audit: patch the holes you could drive a bus through#60

audit: patch the holes you could drive a bus through#60
pi3-code merged 2 commits intomainfrom
audit/production-hardening

Treelovah commented Feb 27, 2026

Uh oh!

pi3-code left a comment

Uh oh!

Treelovah commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Treelovah commented Feb 27, 2026

Summary

Files

Test plan

Uh oh!

pi3-code left a comment

Choose a reason for hiding this comment

Blocking — will fail in production

Worth fixing in this PR

Uh oh!

Treelovah commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants