audit: patch the holes you could drive a bus through#60
Conversation
Turns out if you squint at the codebase long enough, it squints back
with `any` types and copy-pasted shell escaping. Time to fix that
before it fixes us.
The crown jewel: `overallScore || fallback()`. Looks fine until you
remember that 0 is falsy in JavaScript. So every run where the LLM
returned a perfectly legitimate score of zero got its result quietly
tossed in the garbage and replaced with a fallback average. Two
characters — `||` to `??` — and suddenly months of KSM scores make
sense again. You could drive a truck through this bug and nobody
would notice because the fallback was "close enough." decisionQuality
had the same problem. Wonderful.
The `any` types were doing that thing where they make TypeScript
feel like JavaScript with extra steps. LLM responses parsed with
JSON.parse() and immediately trusted via `as { command: string }` —
because why would an LLM ever return something unexpected, right?
Every catch block was `catch (error: any)` followed by a prayer.
Replaced it all with Zod schemas at trust boundaries and actual type
guards for error handling, like civilized people.
Three files had their own artisanal copy of shellEscape(). Two files
had their own provider normalization with identical alias maps. Two
regex patterns validated run IDs independently. A DEFAULT_PROVIDER_URLS
map existed purely to disagree with the PROVIDERS registry it was
supposed to mirror. Consolidated everything — one function, one place,
one truth.
Also: error class hierarchy (so you can tell "config broke" from
"docker broke" without reading tea leaves), 15+ magic numbers
extracted into named constants, a sliding window on message arrays
so 50-iteration runs don't eat your RAM for lunch, and Docker exec
retry with backoff because sometimes the daemon just needs a moment.
27 new tests, 390 total, all green. The shellEscape tests include
injection attempts because trust is earned.
pi3-code
left a comment
There was a problem hiding this comment.
Really solid hardening pass. The falsy-zero fix alone is worth the PR — that's been silently corrupting scores for a while. The Zod schemas at trust boundaries, shellEscape dedup, error hierarchy, and constants extraction are all things this codebase needed. The 27 new tests are substantive (malformed LLM output, shell injection attempts), not filler. Nice work.
I pulled the branch locally, built it, and wrote repro tests for the issues below. Three of them will cause runtime failures, so I'm requesting changes on those. Everything else is a suggestion or a nit.
Blocking — will fail in production
1. Claude agent skips tool_result on invalid tool input (runner.ts:455-459)
When ToolInputSchema.safeParse fails, the code does continue without pushing a tool_result message. The Anthropic Messages API requires a tool_result for every tool_use block — the next API call will reject with a validation error. The OpenAI path (line ~688) already handles this correctly by sending an error message back to the model. The Claude path needs the same pattern.
Suggested fix:
if (!toolInput.success) {
messages.push({
role: 'user',
content: [{
type: 'tool_result',
tool_use_id: block.id,
content: `Error: invalid tool input: ${JSON.stringify(block.input)}`,
is_error: true,
}],
});
continue;
}2. trimMessages will break Anthropic role alternation (runner.ts:32-35)
I reproduced this locally: with 43 messages in a user/assistant/user/assistant sequence, after trimming, msg[0] (user) is followed by msg[4] (also user) — adjacent same-role messages. Anthropic's API rejects this. Needs to either trim in user/assistant pairs, or merge adjacent same-role messages after trimming. Would also benefit from unit tests since it's not exported or tested currently.
3. Plain object throw loses error message (env-check.ts:321)
throw { status: 401, message: '...' } is caught by the outer catch at line 333, which does error instanceof Error ? error.message : 'Unknown error'. Since a plain object isn't an Error instance, users see "Unknown error" instead of "Incorrect API key". This is a good candidate for the new ConfigError class from this same PR.
throw new ConfigError(typeof vErr.message === 'string' ? vErr.message : 'Incorrect API key');Worth fixing in this PR
4. buildRubricScore still uses || 0 (analyzer.ts:401/406/411)
Same falsy-zero bug class that this branch fixes for overallScore and decisionQuality. reconQuality?.score || 0, techniqueSelection?.score || 0, adaptability?.score || 0 should be ?? 0. Currently the fallback is also 0 so the behavior is the same, but it's inconsistent with the fix pattern in the rest of the PR and will bite if the fallback ever changes.
5. ToolInputSchema accepts empty strings (schemas.ts:80)
z.string() accepts "". An empty command passes validation and sends empty stdin to docker exec -i container bash, which hangs for the full 60s timeout. z.string().min(1) would catch this.
6. Docker retry doesn't distinguish transient from permanent errors (runner.ts:189-215)
executeCommand retries all failures 3 times with sleep backoff. A genuine command syntax error (exit code 1, predictable stderr) will waste ~3s before returning the same error. Might be worth checking if the error is Docker-connectivity related before retrying.
7. env-check.ts error extraction could reuse getErrorStatus() (env-check.ts:278/303/334)
Three verbose inline error-status extraction blocks duplicate what getErrorStatus() from retry.ts already does. Would cut ~15 lines of dense type-casting boilerplate.
Overall this is a strong PR. The three blocking items are straightforward fixes — happy to re-review once those are addressed.
Blocking: - runner: push assistant + error tool_result on invalid input before continue, preventing Claude API role-alternation violation - runner: export trimMessages with generic constraint; use while-loop to drop all consecutive same-role messages at trim boundary - env-check: replace plain object throw with ConfigError so getErrorStatus() routes to the 401 branch correctly Worth fixing: - schemas: reject empty command strings (z.string().min(1)) - analyzer: use ?? instead of || for rubric scores so 0 isn't falsy - env-check: reuse getErrorStatus() in all 3 catch blocks, remove unused fs/path imports - runner: add isDockerTransientError() to skip retries on permanent failures, remove unused DockerError import Tests: 7 new trimMessages tests (397 total, all passing)
|
All 7 issues addressed in your commit (3534f2f). Ran 4 full benchmarks post-fix — Sonnet 4.5, Gemini 2.5 Pro, Grok 4, and o3 against proxy-auth-bypass. Runner is solid across all providers, trimMessages survives well past the old iteration 21 crash point, no API errors. 397 tests passing. Ready for your re-approval so we can merge. |
Summary
Production hardening sweep — the kind of cleanup you do when you realize the codebase has been getting away with things.
The falsy-zero bug:
overallScore || fallback()treats0as falsy. Every run where the LLM returned a legitimate score of zero got silently replaced with a fallback average. Two characters (||→??), months of quiet data corruption fixed. Same fordecisionQuality.Type safety: Replaced every
anytype and unsafeascast at trust boundaries with Zod validation schemas and proper type guards. LLM responses, tool inputs, error objects — all of it.catch (error: any)is gone.Deduplication: shellEscape (3 copies → 1), provider normalization (2 identical alias maps → 1), run-ID validation (2 regexes → 1), provider URLs (redundant map deleted, reads from PROVIDERS registry directly).
New infrastructure: Error class hierarchy (
OasisError→ConfigError/DockerError/ValidationError/AnalysisError), named constants replacing 15+ magic numbers, message array sliding window (bounded memory), Docker exec retry with backoff.27 new tests (363 → 390): malformed LLM response parsing, shell escape injection attempts, error extraction edge cases.
Files
src/lib/schemas.tssrc/lib/errors.tssrc/lib/shell.tssrc/lib/constants.tstests/unit/shell.test.tsTest plan
npm run build— cleannpm test— 390/390 passingnpx oasis --help,npx oasis challengescc @pi3-code