Skip to content

Commit f1abc72

Browse files
committed
Add publishable benchmark goal prompt
1 parent f073bc2 commit f1abc72

1 file changed

Lines changed: 214 additions & 0 deletions

File tree

goal.md

Lines changed: 214 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,214 @@
1+
# Goal Prompt: Publishable ContextBench Benchmark
2+
3+
You are in `C:\Users\bitaz\Repos\codebase-context`. Read `AGENTS.md` first and obey it. Load memory first (`npx codebase-context memory list` or MCP memory if available). The user wants a worth-publishing ContextBench benchmark, not another proof-of-life run and not a biased scoreboard.
4+
5+
## Current Proven Baseline
6+
7+
A real five-lane scoreable proof run exists:
8+
9+
- Workflow run: `25644249279`
10+
- Job: `75269988582`
11+
- Artifact: `contextbench-five-lane-score`
12+
- Artifact ID: `6908320478`
13+
- Digest: `sha256:249a5df885bb5b4486b6ae2de6568867b4db97b74e4b46b9ea694f5a434dfa44`
14+
- Task: `SWE-Bench-Pro__go__maintenance__bugfix__4df06349`
15+
- Scoreable lanes: `raw-native`, `codebase-context`, `codebase-memory-mcp`, `grepai`, `ripgrep-lexical`
16+
- Excluded lane: `CodeGraphContext`, because it indexed but returned zero task-relevant candidates via supported CLI. Do not count it unless a new readiness gate proves real candidates and official scoring.
17+
18+
This baseline proves the harness works. It is not yet publishable because it is one task and cost/token instrumentation is incomplete for some lanes.
19+
20+
## Objective
21+
22+
Turn the existing proof run into a publishable, reproducible, bias-resistant benchmark that compares context providers on quality, latency, token cost, setup/index/query cost, reliability, and infrastructure burden.
23+
24+
The benchmark must answer:
25+
26+
1. Which lane retrieves the most useful context according to the official evaluator?
27+
2. How much time, setup, indexing, query work, and token budget does each lane consume?
28+
3. How reliable is each lane across tasks and repositories?
29+
4. Which results are statistically meaningful, and which are only anecdotal?
30+
5. What failed, why, and what was excluded by pre-registered rules?
31+
32+
## Non-Negotiable Integrity Rules
33+
34+
- Use the official ContextBench evaluator for quality rows.
35+
- Never count `setup_failed`, `index_failed`, `tool_error`, `empty_prediction`, or `judge_failed` as benchmark quality results.
36+
- Report failed rows separately under reliability and setup/integration cost.
37+
- Freeze the benchmark protocol before the main run: task set, competitors, budgets, metrics, failure policy, seeds, and analysis plan.
38+
- Do not tune lane prompts, candidate caps, selector prompts, or scoring logic on the main benchmark set.
39+
- Do not inspect gold patches/diffs while selecting predictions.
40+
- Do not post-hoc remove bad results. Exclusions must follow the frozen protocol.
41+
- If a lane cannot pass readiness, either fix the integration or exclude it with exact evidence. Do not invent a result.
42+
- Keep setup/index/query cost separate from quality metrics.
43+
- Do not overclaim in public text. If this remains a pilot, call it a pilot.
44+
45+
## Bias Controls
46+
47+
Before running the main benchmark:
48+
49+
1. Pre-register the protocol in a machine-readable artifact, preferably `benchmark-protocol.json` or equivalent existing benchmark config. Avoid unnecessary markdown sprawl.
50+
2. Select tasks using a deterministic seed before looking at lane outputs.
51+
3. Use the same frozen task set for every lane.
52+
4. Stratify tasks by repository/language/domain when possible. If limited to Go/SWE-Bench-Pro, say that explicitly.
53+
5. Keep a pilot set separate from the main set. Use the pilot only to debug infrastructure and instrumentation.
54+
6. Freeze selector prompts after the pilot.
55+
7. For LLM-based selection, feed only the problem statement and lane candidate pack. Never include gold files, gold spans, patches, or evaluator output.
56+
8. Use the same selector model for all lanes unless the lane itself is being evaluated as an LLM system. Current required selector model: `gpt-5.4-mini-high`.
57+
9. Preserve every artifact: readiness reports, candidate packs, predictions, official scores, timing logs, token logs, and failure logs.
58+
10. Run an independent bias audit at the end: check leakage, cherry-picking, missing rows, prompt tuning, task selection, and inconsistent budgets.
59+
60+
## Required Metrics
61+
62+
Quality metrics from official evaluator:
63+
64+
- File coverage and precision
65+
- Symbol coverage and precision
66+
- Span coverage and precision
67+
- Line coverage and precision
68+
- Editloc recall and precision
69+
- Per-task score rows, not only aggregate means
70+
71+
Cost and efficiency metrics per lane and per task:
72+
73+
- Setup duration
74+
- Install/download duration if measurable
75+
- Index duration
76+
- Query duration
77+
- Selector duration
78+
- Official evaluator duration
79+
- Total wall-clock duration
80+
- Candidate count
81+
- Candidate bytes/chars
82+
- Candidate estimated tokens
83+
- Prediction file count
84+
- Prediction span count
85+
- Peak memory if feasible in CI
86+
- Disk footprint/download size if feasible
87+
- Infra mode: local/no-infra, local-with-service, CI, Docker, external API, etc.
88+
89+
Token metrics per lane and per task:
90+
91+
- Prompt/input tokens
92+
- Completion/output tokens
93+
- Cached input tokens if available
94+
- Reasoning tokens if available
95+
- Total billed tokens if available
96+
- Estimated candidate tokens using a tokenizer when provider usage is unavailable
97+
- Mark tokens as `null` when truly unavailable. Do not fabricate.
98+
99+
Reliability metrics:
100+
101+
- Readiness pass/fail rate
102+
- Setup failure rate
103+
- Index failure rate
104+
- Tool call failure rate
105+
- Empty prediction rate
106+
- Judge failure rate
107+
- Retry count
108+
- Timeout count
109+
110+
Statistical reporting:
111+
112+
- Mean, median, standard deviation, and bootstrap 95% confidence intervals per lane
113+
- Paired per-task comparisons against `codebase-context` and simple baselines
114+
- Effect sizes, not just p-values
115+
- Failure-inclusive reliability tables separate from scoreable-only quality tables
116+
117+
## Competitor Policy
118+
119+
Start with the five scoreable lanes from the proof run:
120+
121+
- `raw-native`
122+
- `codebase-context`
123+
- `codebase-memory-mcp`
124+
- `grepai`
125+
- `ripgrep-lexical`
126+
127+
Treat `CodeGraphContext` as an attempted competitor, not a quality row, until readiness proves:
128+
129+
- setup/index works,
130+
- lane tool is callable,
131+
- candidate files/spans are non-empty,
132+
- selected predictions are non-empty,
133+
- official evaluator scores the row.
134+
135+
If adding more competitors, pre-register them before the main run. Do not add competitors after seeing main results unless clearly labeled as exploratory.
136+
137+
## Execution Plan For The Agent
138+
139+
1. Restore context.
140+
- Read `AGENTS.md`.
141+
- Load memory.
142+
- Inspect existing benchmark scripts/workflows/artifacts.
143+
- Confirm the proof run and artifact above.
144+
145+
2. Audit the current harness.
146+
- Locate readiness scripts, score scripts, selection files, workflows, and evaluator integration.
147+
- Identify all current `n/a` timing/token fields.
148+
- Replace `n/a` with measured values where possible, or explicit `null` plus `measurementUnavailableReason`.
149+
150+
3. Add a unified metrics schema.
151+
- Every lane/task row should emit one JSON object with quality, time, token, setup, infra, reliability, and provenance fields.
152+
- Add schema validation so incomplete metrics fail fast unless explicitly marked unavailable.
153+
154+
4. Add time instrumentation.
155+
- Measure setup/install, index, query, selector, evaluator, and total durations separately.
156+
- Use wall-clock timers around actual commands, not inferred estimates.
157+
- Preserve raw command logs and summarized JSON.
158+
159+
5. Add token instrumentation.
160+
- For OpenAI/API calls, capture usage metadata directly: input, output, cached, reasoning, total where available.
161+
- For Codex/subagent flows where direct API usage is unavailable, record provider usage as `null` and add deterministic estimated tokens for candidate/prompt text using a tokenizer.
162+
- For non-LLM lanes, record LLM tokens as zero only if no LLM was used; otherwise record the selector usage separately.
163+
164+
6. Pre-register the benchmark protocol.
165+
- Define sample size, task-selection seed, task filters, lane list, budgets, timeouts, retry policy, and exclusion policy.
166+
- Recommended sequence: pilot 5 tasks, then main 30+ tasks if CI budget allows. If budget is constrained, justify the smaller sample as a pilot.
167+
- Commit/configure the frozen protocol before running main results.
168+
169+
7. Run readiness gates.
170+
- Every required lane must pass readiness on the frozen task set or be excluded by protocol.
171+
- Readiness must prove setup/index, tool callability, non-empty candidates, non-empty predictions, and official scoring on at least a readiness slice.
172+
173+
8. Run the pilot.
174+
- Use the pilot only to debug integration and metrics collection.
175+
- Do not use pilot outcomes to tune the main task set or overfit prompts.
176+
177+
9. Freeze and run the main benchmark.
178+
- Run all lanes on the same task set.
179+
- Enforce same budgets where comparable: candidate caps, token caps, timeout caps, selector model, and output format.
180+
- Store all artifacts with digests.
181+
182+
10. Analyze results.
183+
- Produce scoreable-only quality tables.
184+
- Produce all-attempt reliability/failure tables.
185+
- Produce setup/index/query/time/token cost tables.
186+
- Produce paired statistical comparisons and confidence intervals.
187+
- Include failure analysis and limitations.
188+
189+
11. Audit for bias.
190+
- Use a fresh review pass to check protocol drift, gold leakage, cherry-picking, inconsistent budgets, hidden failures, and post-hoc exclusions.
191+
- If any integrity violation is found, stop and fix before publishing.
192+
193+
12. Publishable output.
194+
- Produce a concise technical report with method, task set, lanes, metrics, results, costs, failures, limitations, and reproduction commands.
195+
- Produce a human-readable summary for LinkedIn/Reddit only after the technical report is evidence-backed.
196+
- Do not claim general superiority beyond the measured sample.
197+
198+
## Acceptance Criteria
199+
200+
The task is complete only when:
201+
202+
- A frozen benchmark protocol exists.
203+
- A multi-task benchmark run completes or a real blocker is documented.
204+
- Every published quality number comes from official evaluator scoreable rows.
205+
- Time and token metrics are present per lane/task or explicitly marked unavailable with reason.
206+
- Setup/index/query costs are separate from quality scores.
207+
- Reliability/failure tables include all attempted rows.
208+
- Artifacts include enough provenance to reproduce the results.
209+
- A bias audit finds no unresolved leakage, cherry-picking, or inconsistent-budget issue.
210+
- The final report clearly distinguishes pilot evidence from publishable claims.
211+
212+
## Final Instruction
213+
214+
Do not optimize for a pretty number. Optimize for trust. If the retrieval is bad, say it is bad. If a lane is expensive, say it is expensive. If the sample is too small, call it a pilot. The benchmark is only worth publishing if a skeptical reader can reproduce it and see exactly where every number came from.

0 commit comments

Comments
 (0)