Skip to content

fix: robust analysis parsing and same-provider model resolution#61

Merged
Treelovah merged 4 commits intomainfrom
fix/analyzer-provider-resolution
Mar 2, 2026
Merged

fix: robust analysis parsing and same-provider model resolution#61
Treelovah merged 4 commits intomainfrom
fix/analyzer-provider-resolution

Conversation

@pi3-code
Copy link
Contributor

@pi3-code pi3-code commented Feb 27, 2026

Problem

Analysis consistently fails with could not parse LLM response (truncated or malformed JSON) across multiple providers. Three related issues:

  1. JSON extraction is fragile — the parser only strips ```json fences. When models (especially Gemini) wrap JSON in explanation text or append trailing content, JSON.parse fails on the raw string.

  2. MAX_COMPLETION_TOKENS too low — set to 4096, which truncates complex analyses (many steps, detailed rubric evaluations with per-milestone reasoning). Truncated JSON → parse failure → zero scores.

  3. Analyzer model resolution ignores user's providerparseAnalysisResponse hardcodes analyzerModel: DEFAULT_ANALYZER_MODEL (Claude Sonnet), so results always display claude-sonnet-4-5-20250929 even when analysis actually ran on Gemini or Grok. Additionally, when the benchmark and analysis use the same provider, image/embedding models (e.g. grok-imagine-image, text-embedding-3-large) are returned as the analyzer model, causing 400 errors.

Changes

src/lib/analyzer.ts

  • Add extractJson() function (exported) — brace-matching scanner that handles markdown fences, explanation text wrapping, escaped quotes, nested braces, and trailing content. Always runs the scanner (no early-return bypass).
  • Thread analyzerModel through parseAnalysisResponse — display the actual model used instead of hardcoded default.
  • Anchored blocklist for non-text models — explicit NON_TEXT_MODEL_PATTERNS array with anchored regexes (/^grok-imagine-/i, /^dall-e-/i, etc.) to avoid false positives on legitimate text models like gpt-5-image-understanding.

src/commands/run.ts

  • Fix API key fallthrough — only use the benchmark provider's key as fallback for the analyzer when the providers match. Prevents passing e.g. a Google key to the Anthropic SDK when --analyzer-provider differs.

src/lib/export.ts

  • Skip export when no analysis — instead of showing export options without scores, guide user to run oasis analyze then oasis report.
  • Reorder export menu — "Done" becomes default choice after the first export action.

src/lib/report.ts

  • Fix dead URL — replace oasis.kryptsec.com with github.com/KryptSec/oasis in share cards and HTML reports.

tests/unit/analyzer.test.ts

  • 10 direct extractJson tests: clean passthrough, leading/trailing text, markdown fences (including mid-string), nested objects, braces inside strings, unclosed JSON, multiple objects, no-JSON input
  • Non-text model filter tests (image, embedding, tts models) + false-positive guard (gpt-5-image-understanding is NOT filtered)
  • Same-provider tests verifying benchmark model is preserved

Token cost note

MAX_COMPLETION_TOKENS stays at 4096 (original value). The truncation failures were caused by models wrapping JSON in explanation text — extractJson() solves this by scanning for braces regardless of preamble length. No token increase needed.

Reproduction

# Before fix — fails consistently
oasis run -c jwt-forgery -m models/gemini-3-pro-preview -p google --analyze
# → "could not parse LLM response (truncated or malformed JSON)"

# After fix — works
oasis run -c jwt-forgery -m models/gemini-3-pro-preview -p google --analyze
# → Analysis complete, KSM 76.6, Model: models/gemini-3-pro-preview

Test plan

  • npm run build — clean compilation
  • npx vitest run — 411/411 tests pass (11 new)
  • Manual: benchmark + analysis with Google Gemini — analysis completes, correct model displayed
  • Manual: oasis analyze <id> -p google --reanalyze — consistent results across re-runs
  • Verify image model blocklist: oasis analyze <id> with a run that used grok-imagine-image falls back to grok-4-0709

- Skip export prompt entirely when no analysis available (API quota/rate limit),
  guide user to run `oasis analyze` then `oasis report` instead
- Reorder export menu so "Done" is default after first action
- Replace dead oasis.kryptsec.com URL with GitHub repo in reports
- Add robust JSON extraction (brace-matching) for LLM responses wrapped
  in explanation text or trailing content
- Increase MAX_COMPLETION_TOKENS from 4096 to 8192 to prevent truncation
- Use benchmark model for analysis instead of hardcoding claude-sonnet,
  with blocklist for non-text models (image/embed/tts)
- Display actual analyzer model in results instead of hardcoded default
- Prevent wrong API key passthrough when analyzer provider differs
@pi3-code pi3-code requested a review from Treelovah February 27, 2026 22:57
Copy link
Contributor

@Treelovah Treelovah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core fixes here — extractJson(), threading analyzerModel, the API key fallthrough guard, image model filter — are solid work solving real problems. Three things need addressing before merge.

1. Export extractJson and test it directly.

This is a hand-rolled brace-matching JSON scanner and the most important addition in the PR. Right now it has zero direct tests — it's only exercised indirectly through parseAnalysisResponse, which also runs Zod parsing, default filling, and score computation. Too much noise to isolate a scanner regression.

Export it. Write 6-8 targeted cases: clean JSON passthrough, leading/trailing explanation text, markdown fences, nested objects, braces inside string values, unclosed JSON, no JSON at all. It's a pure function with deterministic I/O — exactly the kind of thing that benefits from unit tests. Don't add more parseAnalysisResponse cases though — the existing ones cover the integration path fine.

2. Drop the token bump or demonstrate the need.

I checked every saved analysis result in the repo. The largest JSON payload across all runs is ~2,959 tokens. Most are under 2,000. 4096 already gives over 1,100 tokens of headroom above our worst case.

The truncation failures were caused by models wrapping JSON in explanation text, inflating the total response past 4096. extractJson() already solves that — the scanner doesn't care how much preamble the model generates, it just finds the braces. Doubling MAX_COMPLETION_TOKENS to 8192 is solving a problem the parser fix already addressed, and it doubles worst-case analysis cost on a tool where users pay their own API bills.

Drop the change. If you disagree, post a result where the extracted JSON alone exceeds 4096 tokens — as of now this looks like an edge case that doesn't exist in practice.

3. Split unrelated changes into a separate PR.

The export.ts UX rework (skip export when no analysis, reorder Done choice) and the report.ts link updates are good changes but have nothing to do with analysis parsing or model resolution. Let's make these PRs separately so we can really dive into the big one without the need for cherry picking.

Copy link
Contributor

@Treelovah Treelovah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — Request Changes

The fixes here are real and well-understood — the extractJson() scanner, the model resolution threading, and the API key fallthrough are all clear improvements. But this needs a few things before it's merge-ready.

1. extractJson() needs direct unit tests

This is the most important function in the PR and it's only tested indirectly through parseAnalysisResponse. For a parser, that's not sufficient. We need edge case coverage:

  • Trailing explanation text after valid JSON
  • Nested braces inside string values ({"key": "value with { braces }"})
  • Unclosed/truncated JSON
  • Bare JSON with no markdown fences
  • Multiple JSON objects in one response (should extract the first)
  • Fences not at start of string (e.g. "Here's the result:\n```json\n{...}\n```")

2. Non-text model regex is too loose

/imagine|image|embed|tts|whisper|dall-e/i matching on substring will false-positive the moment someone benchmarks a vision model with "image" in the name (e.g. gpt-5-image-understanding). Two options:

  • Preferred: Explicit blocklist array with anchored patterns (e.g. /^grok-imagine-/, /^dall-e-/, /^tts-/, /^whisper-/, /[-.]embed(ding)?[-.]/)
  • Alternative: Check model capabilities rather than pattern-matching names

At minimum, anchor the patterns more tightly so legitimate text models aren't caught.

3. Split unrelated changes into a separate PR

The export.ts UX improvements (early return + choice reordering) and the report.ts link updates (oasis.kryptsec.comgithub.com/KryptSec/oasis) are unrelated to analysis parsing. One concern per PR — we enforced this on ourselves during the hardening audit, same standard applies here.

4. Acknowledge the token cost tradeoff

MAX_COMPLETION_TOKENS 4096 → 8192 doubles worst-case spend per analysis. The change itself is justified, but add a one-liner to the PR description noting the cost tradeoff so users aren't surprised.


To be clear: the engineering here is solid. These are "make it merge-ready for a public repo" asks, not fundamental objections.

…t token bump

- Export extractJson and add 10 direct unit tests covering: clean
  passthrough, leading/trailing text, markdown fences, nested objects,
  braces in strings, unclosed JSON, multiple objects, no-JSON input
- Replace loose /imagine|image|embed|.../ regex with anchored
  NON_TEXT_MODEL_PATTERNS blocklist to prevent false positives on
  vision models like gpt-5-image-understanding
- Revert MAX_COMPLETION_TOKENS to 4096 — extractJson already strips
  preamble text that caused truncation, token bump is redundant
@pi3-code pi3-code requested a review from Treelovah March 2, 2026 16:55
The original PR swapped oasis.kryptsec.com for the GitHub repo URL in
share cards and HTML reports. That domain is live. Put it back.
@Treelovah Treelovah merged commit 3241c6c into main Mar 2, 2026
3 checks passed
@Treelovah Treelovah deleted the fix/analyzer-provider-resolution branch March 2, 2026 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants