Add benchmark runner for populate workflow#101
Conversation
Records search/fetch/investigate call counts, token usage (input/output), and timing per populate run to a new Convex `populateRuns` table. Metrics are counted by scanning result.steps[].toolCalls after agent.generate() returns — no factory wrappers needed. The save is fire-and-forget in the finally block so it never blocks or fails the workflow. New files: - backend/src/mastra/run-metrics.ts — RunMetrics accumulator class - backend/src/mastra/save-run-metrics.ts — writes completed run to Convex - frontend/convex/populateRuns.ts — insert mutation + list/get queries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Move buildPopulateAgent into the try block so any construction error is caught and the finally block (metrics save) always executes - Add status/error consistency guard to populateRuns.insert so a "success" record can't carry an error string and an "error" record can't be written without a message Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR introduces a complete benchmarking framework for the populate workflow. It adds internal Convex mutations to create and delete datasets without authentication, a TypeScript benchmark runner that orchestrates workflow invocation with configurable concurrency, and infrastructure to collect and aggregate metrics into structured JSON reports. Four benchmark datasets are predefined, comprehensive documentation covers usage and metric interpretation, and a Make target integrates the runner into the development environment. Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Renames the Convex table and module that stores per-run metrics from populateRuns to runStats for a cleaner, more descriptive name. - frontend/convex/populateRuns.ts → runStats.ts - schema.ts: table definition populateRuns → runStats - save-run-metrics.ts: internal.populateRuns.insert → internal.runStats.insert - run-metrics.ts: update comment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1d85d2c to
f95611c
Compare
Two bugs in the Mastra FullOutput step scanning:
1. Tool name is nested at tc.payload.toolName, not tc.toolName
Mastra's ToolCallChunk wraps everything in a payload object
(type: 'tool-call', payload: { toolName, toolCallId, args }).
The scan now reads tc.payload?.toolName with tc.toolName as fallback.
2. result.usage is last-step tokens only; result.totalUsage is the sum
across all steps (per Mastra FullOutput type definition). Switch
addOrchestratorResult/addInvestigateResult to prefer totalUsage so
multi-step runs aren't undercounted.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
f95611c to
54d1251
Compare
Per-step toolCalls snapshots are captured at step-finish time. Tool-call chunks that arrive after their step's step-finish event (e.g. non-streaming tool-call events that follow a streaming-end synthetic) land in the next step's buffer, causing systematic miscounts. result.toolCalls is the flat accumulated list maintained by Mastra's stream processor as chunks arrive — it is the authoritative source. Using it directly avoids the step-ordering sensitivity entirely. Also adds toolCalls length to the [run_subagent] done log line for easier verification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a benchmark suite that exercises the real populateWorkflow directly (no HTTP layer) so metrics reflect actual agent performance. Results are stored in the same populateRuns Convex table as live app sessions, tagged with isBenchmark: true for filtering. - benchmarks/run.mts: runner that creates temporary Convex datasets, calls populateWorkflow, reads back metrics, and emits a JSON summary - benchmarks/prompts.json: 4 test prompts (YC companies, B2B SaaS, national parks, AI research labs) - benchmarks/README.md: instructions for running benchmarks and querying live run stats via the Convex dashboard or CLI - datasets.createInternal / deleteInternal: Convex mutations for benchmark dataset lifecycle (no Clerk auth required) - make benchmark target for convenient invocation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add isPrimaryKey: true to the natural PK column in each of the 4 benchmark prompts so the orchestrator gets explicit pk guidance and deduplication works correctly with the run_subagent tool's .refine() enforcement - Fix investigateCalls description: investigate_row → run_subagent - Add note that enumeration-step tokens are not captured in RunMetrics - Update custom-prompt example to show isPrimaryKey usage - Update cost table model name: Kimi K2 → DeepSeek V4 Pro Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Follows the rename in feature/cost-tracking. Updates the Convex query call and all documentation references in run.mts and README.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
54d1251 to
a642945
Compare
|
|
✏️ Learnings added
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmarks/README.md`:
- Around line 201-203: Add a language tag to the fenced log block containing
"[populate-agent] populate-agent start" so markdownlint MD040 stops flagging it;
change the block from ``` to ```text (i.e., replace the opening triple backticks
with ```text) to label the sample log as plain text.
In `@benchmarks/run.mts`:
- Around line 105-121: The parseArgs function currently accepts missing option
values and non-positive concurrency, which can cause hangs or silent failures;
update parseArgs to validate that options that require values (--prompt,
--prompts, --out, --concurrency) actually have a following argument (check i+1 <
argv.length before consuming argv[++i]) and throw or exit with a clear error if
missing, and after parsing ensure result.concurrency is a positive integer
(check Number.isInteger(concurrency) && concurrency > 0 and reject
NaN/0/negative values), also guard against parseInt producing NaN for
--concurrency and fail fast rather than allowing undefined/NaN to propagate.
Include references to parseArgs, promptIds, promptsFile, outDir, and concurrency
in the changes.
- Around line 237-285: In buildSummary, stop filtering out successful runs that
lack metrics: keep successCount based on status === "success" (e.g., compute
successful = results.filter(r => r.status === "success") and failed =
results.filter(r => r.status !== "success")), but compute totals only from runs
that actually have metrics (e.g., metricsAvailable = successful.filter(r =>
r.metrics) and use metricsAvailable in the reduce and for n for averages).
Update referenced symbols: buildSummary, successful, failed, totals, and
perRunAverages so promptCount === successCount + failureCount and averages are
computed from metricsAvailable (with a fallback n = metricsAvailable.length ||
1).
In `@frontend/convex/datasets.ts`:
- Around line 251-277: The createInternal internalMutation currently accepts
caller-supplied ownerId and visibility which lets internal callers mint
user-owned or public datasets; remove ownerId and visibility from the args
schema (delete ownerId and visibility from the args block) and in the handler
hard-code ownerId to "benchmark-runner" and visibility to "private" (keep other
fields and behavior unchanged) so benchmark datasets are always owned by the
fixed synthetic user and always private; reference createInternal and the
internalMutation args/handler when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b00257fa-79af-4a5d-8957-e442d8832290
📒 Files selected for processing (5)
benchmarks/README.mdbenchmarks/prompts.jsonbenchmarks/run.mtsfrontend/convex/datasets.tsmakefiles/Makefile
| ``` | ||
| [populate-agent] populate-agent start | ||
| ``` |
There was a problem hiding this comment.
Add a language tag to this fenced block.
markdownlint MD040 will keep failing here until this sample log block is labeled. text is enough.
Proposed fix
-```
+```text
[populate-agent] populate-agent start</details>
<details>
<summary>🧰 Tools</summary>
<details>
<summary>🪛 markdownlint-cli2 (0.22.1)</summary>
[warning] 201-201: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
</details>
</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @benchmarks/README.md around lines 201 - 203, Add a language tag to the
fenced log block containing "[populate-agent] populate-agent start" so
markdownlint MD040 stops flagging it; change the block from totext
(i.e., replace the opening triple backticks with ```text) to label the sample
log as plain text.
</details>
<!-- fingerprinting:phantom:poseidon:grasshopper -->
<!-- This is an auto-generated comment by CodeRabbit -->
| function parseArgs(argv: string[]) { | ||
| const result = { | ||
| promptIds: [] as string[], | ||
| promptsFile: join(dirname(fileURLToPath(import.meta.url)), "prompts.json"), | ||
| outDir: null as string | null, | ||
| cleanup: true, | ||
| concurrency: 1, | ||
| }; | ||
| for (let i = 0; i < argv.length; i++) { | ||
| const arg = argv[i]; | ||
| if (arg === "--prompt") result.promptIds.push(argv[++i]); | ||
| else if (arg === "--prompts") result.promptsFile = argv[++i]; | ||
| else if (arg === "--out") result.outDir = argv[++i]; | ||
| else if (arg === "--no-cleanup") result.cleanup = false; | ||
| else if (arg === "--concurrency") result.concurrency = parseInt(argv[++i], 10); | ||
| } | ||
| return result; |
There was a problem hiding this comment.
Reject missing option values and non-positive concurrency.
--concurrency 0 makes the loop at Line 347 never advance, and a missing value after --prompt, --prompts, --out, or --concurrency currently becomes undefined/NaN silently. Fail fast during parsing so the runner cannot hang or no-op on malformed CLI input.
Proposed fix
function parseArgs(argv: string[]) {
const result = {
promptIds: [] as string[],
promptsFile: join(dirname(fileURLToPath(import.meta.url)), "prompts.json"),
outDir: null as string | null,
cleanup: true,
concurrency: 1,
};
+ const requireValue = (flag: string, value: string | undefined) => {
+ if (!value || value.startsWith("--")) {
+ throw new Error(`${flag} requires a value`);
+ }
+ return value;
+ };
for (let i = 0; i < argv.length; i++) {
const arg = argv[i];
- if (arg === "--prompt") result.promptIds.push(argv[++i]);
- else if (arg === "--prompts") result.promptsFile = argv[++i];
- else if (arg === "--out") result.outDir = argv[++i];
+ if (arg === "--prompt") result.promptIds.push(requireValue(arg, argv[++i]));
+ else if (arg === "--prompts") result.promptsFile = requireValue(arg, argv[++i]);
+ else if (arg === "--out") result.outDir = requireValue(arg, argv[++i]);
else if (arg === "--no-cleanup") result.cleanup = false;
- else if (arg === "--concurrency") result.concurrency = parseInt(argv[++i], 10);
+ else if (arg === "--concurrency") {
+ const raw = requireValue(arg, argv[++i]);
+ const parsed = Number.parseInt(raw, 10);
+ if (!Number.isInteger(parsed) || parsed < 1) {
+ throw new Error(`--concurrency must be a positive integer, received "${raw}"`);
+ }
+ result.concurrency = parsed;
+ }
}
return result;
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmarks/run.mts` around lines 105 - 121, The parseArgs function currently
accepts missing option values and non-positive concurrency, which can cause
hangs or silent failures; update parseArgs to validate that options that require
values (--prompt, --prompts, --out, --concurrency) actually have a following
argument (check i+1 < argv.length before consuming argv[++i]) and throw or exit
with a clear error if missing, and after parsing ensure result.concurrency is a
positive integer (check Number.isInteger(concurrency) && concurrency > 0 and
reject NaN/0/negative values), also guard against parseInt producing NaN for
--concurrency and fail fast rather than allowing undefined/NaN to propagate.
Include references to parseArgs, promptIds, promptsFile, outDir, and concurrency
in the changes.
| function buildSummary(results: RunResult[]) { | ||
| const successful = results.filter((r) => r.status === "success" && r.metrics); | ||
| const failed = results.filter((r) => r.status !== "success"); | ||
|
|
||
| const totals = successful.reduce( | ||
| (acc, r) => { | ||
| const m = r.metrics!; | ||
| acc.rowsInserted += m.rowsInserted; | ||
| acc.searchCalls += m.searchCalls; | ||
| acc.fetchCalls += m.fetchCalls; | ||
| acc.investigateCalls += m.investigateCalls; | ||
| acc.tokensInput += m.tokensInput; | ||
| acc.tokensOutput += m.tokensOutput; | ||
| acc.orchestratorSteps += m.orchestratorSteps; | ||
| acc.investigateSteps += m.investigateSteps; | ||
| acc.investigateRuns += m.investigateRuns; | ||
| acc.durationMs += r.durationMs; | ||
| return acc; | ||
| }, | ||
| { | ||
| rowsInserted: 0, | ||
| searchCalls: 0, | ||
| fetchCalls: 0, | ||
| investigateCalls: 0, | ||
| tokensInput: 0, | ||
| tokensOutput: 0, | ||
| orchestratorSteps: 0, | ||
| investigateSteps: 0, | ||
| investigateRuns: 0, | ||
| durationMs: 0, | ||
| }, | ||
| ); | ||
|
|
||
| const n = successful.length || 1; | ||
|
|
||
| return { | ||
| completedAt: new Date().toISOString(), | ||
| promptCount: results.length, | ||
| successCount: successful.length, | ||
| failureCount: failed.length, | ||
| aggregate: totals, | ||
| perRunAverages: { | ||
| rowsInserted: +(totals.rowsInserted / n).toFixed(1), | ||
| searchCalls: +(totals.searchCalls / n).toFixed(1), | ||
| fetchCalls: +(totals.fetchCalls / n).toFixed(1), | ||
| investigateCalls: +(totals.investigateCalls / n).toFixed(1), | ||
| tokensTotal: +((totals.tokensInput + totals.tokensOutput) / n).toFixed(0), | ||
| durationSeconds: +((totals.durationMs / n / 1000).toFixed(1)), | ||
| }, |
There was a problem hiding this comment.
Count successful workflows separately from metrics availability.
A run can finish with status: "success" and still miss metrics if the async write lands after the fixed sleep in runPrompt. Right now those runs disappear from both successCount and failureCount, so promptCount can exceed the sum of both counters and the averages get underreported.
Proposed fix
function buildSummary(results: RunResult[]) {
- const successful = results.filter((r) => r.status === "success" && r.metrics);
+ const successful = results.filter((r) => r.status === "success");
+ const successfulWithMetrics = successful.filter((r) => r.metrics);
const failed = results.filter((r) => r.status !== "success");
- const totals = successful.reduce(
+ const totals = successfulWithMetrics.reduce(
(acc, r) => {
const m = r.metrics!;
acc.rowsInserted += m.rowsInserted;
@@
- const n = successful.length || 1;
+ const n = successfulWithMetrics.length || 1;
return {
completedAt: new Date().toISOString(),
promptCount: results.length,
- successCount: successful.length,
+ successCount: successful.length,
failureCount: failed.length,
+ missingMetricsCount: successful.length - successfulWithMetrics.length,
aggregate: totals,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmarks/run.mts` around lines 237 - 285, In buildSummary, stop filtering
out successful runs that lack metrics: keep successCount based on status ===
"success" (e.g., compute successful = results.filter(r => r.status ===
"success") and failed = results.filter(r => r.status !== "success")), but
compute totals only from runs that actually have metrics (e.g., metricsAvailable
= successful.filter(r => r.metrics) and use metricsAvailable in the reduce and
for n for averages). Update referenced symbols: buildSummary, successful,
failed, totals, and perRunAverages so promptCount === successCount +
failureCount and averages are computed from metricsAvailable (with a fallback n
= metricsAvailable.length || 1).
| /** | ||
| * Create a dataset without Clerk auth. Used exclusively by the benchmark | ||
| * runner, which calls populateWorkflow directly (no HTTP layer) and | ||
| * therefore has no Clerk identity. The ownerId must be supplied by the | ||
| * caller; for benchmark runs pass "benchmark-runner". | ||
| */ | ||
| export const createInternal = internalMutation({ | ||
| args: { | ||
| name: v.string(), | ||
| description: v.string(), | ||
| cadence: v.string(), | ||
| columns: v.array(columnValidator), | ||
| ownerId: v.string(), | ||
| visibility: v.optional(v.union(v.literal("public"), v.literal("private"))), | ||
| }, | ||
| handler: async (ctx, args) => { | ||
| return await ctx.db.insert("datasets", { | ||
| name: args.name, | ||
| description: args.description, | ||
| cadence: args.cadence, | ||
| columns: args.columns, | ||
| ownerId: args.ownerId, | ||
| status: "paused", | ||
| visibility: args.visibility ?? "private", | ||
| rowCount: 0, | ||
| }); | ||
| }, |
There was a problem hiding this comment.
Lock benchmark datasets to a fixed synthetic owner.
This mutation is documented as benchmark-only, but it still accepts arbitrary ownerId and visibility. That makes it possible for any internal caller to mint user-owned or public datasets without going through the normal authz path. Since the current runner always uses benchmark-runner and private, those values should be fixed inside the mutation instead of being caller-controlled.
Proposed fix
export const createInternal = internalMutation({
args: {
name: v.string(),
description: v.string(),
cadence: v.string(),
columns: v.array(columnValidator),
- ownerId: v.string(),
- visibility: v.optional(v.union(v.literal("public"), v.literal("private"))),
},
handler: async (ctx, args) => {
return await ctx.db.insert("datasets", {
name: args.name,
description: args.description,
cadence: args.cadence,
columns: args.columns,
- ownerId: args.ownerId,
+ ownerId: "benchmark-runner",
status: "paused",
- visibility: args.visibility ?? "private",
+ visibility: "private",
rowCount: 0,
});
},
});As per coding guidelines, "ownerId field on datasets stores identity.subject (Clerk user ID). Do not pass ownerId from the client."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /** | |
| * Create a dataset without Clerk auth. Used exclusively by the benchmark | |
| * runner, which calls populateWorkflow directly (no HTTP layer) and | |
| * therefore has no Clerk identity. The ownerId must be supplied by the | |
| * caller; for benchmark runs pass "benchmark-runner". | |
| */ | |
| export const createInternal = internalMutation({ | |
| args: { | |
| name: v.string(), | |
| description: v.string(), | |
| cadence: v.string(), | |
| columns: v.array(columnValidator), | |
| ownerId: v.string(), | |
| visibility: v.optional(v.union(v.literal("public"), v.literal("private"))), | |
| }, | |
| handler: async (ctx, args) => { | |
| return await ctx.db.insert("datasets", { | |
| name: args.name, | |
| description: args.description, | |
| cadence: args.cadence, | |
| columns: args.columns, | |
| ownerId: args.ownerId, | |
| status: "paused", | |
| visibility: args.visibility ?? "private", | |
| rowCount: 0, | |
| }); | |
| }, | |
| /** | |
| * Create a dataset without Clerk auth. Used exclusively by the benchmark | |
| * runner, which calls populateWorkflow directly (no HTTP layer) and | |
| * therefore has no Clerk identity. The ownerId must be supplied by the | |
| * caller; for benchmark runs pass "benchmark-runner". | |
| */ | |
| export const createInternal = internalMutation({ | |
| args: { | |
| name: v.string(), | |
| description: v.string(), | |
| cadence: v.string(), | |
| columns: v.array(columnValidator), | |
| }, | |
| handler: async (ctx, args) => { | |
| return await ctx.db.insert("datasets", { | |
| name: args.name, | |
| description: args.description, | |
| cadence: args.cadence, | |
| columns: args.columns, | |
| ownerId: "benchmark-runner", | |
| status: "paused", | |
| visibility: "private", | |
| rowCount: 0, | |
| }); | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@frontend/convex/datasets.ts` around lines 251 - 277, The createInternal
internalMutation currently accepts caller-supplied ownerId and visibility which
lets internal callers mint user-owned or public datasets; remove ownerId and
visibility from the args schema (delete ownerId and visibility from the args
block) and in the handler hard-code ownerId to "benchmark-runner" and visibility
to "private" (keep other fields and behavior unchanged) so benchmark datasets
are always owned by the fixed synthetic user and always private; reference
createInternal and the internalMutation args/handler when making the change.
Summary
populateWorkflowdirectly (no HTTP layer), so results reflect actual agent performance and metrics land in the samepopulateRunsConvex table as live app sessionsisPrimaryKeyannotations for proper dedupmake benchmarktarget for convenient invocation; supports--prompt,--out,--no-cleanup,--concurrencyflagspopulateRunstable,createInternal/deleteInternalmutations)Changes
benchmarks/run.mts— runner: creates temp Convex datasets, callspopulateWorkflow, reads back metrics, emits JSON summarybenchmarks/prompts.json— 4 test prompts withisPrimaryKeyon each natural PK columnbenchmarks/README.md— usage guide, JSON output format, Convex CLI queries, cost estimation; includes note that enumeration-step tokens are not captured inRunMetricsfrontend/convex/datasets.ts—createInternal/deleteInternalinternal mutations for benchmark dataset lifecycle (no Clerk auth required)makefiles/Makefile—benchmarktargetTest plan
make devis running with.envfilled inmake benchmark ARGS="--prompt yc-recent-batch-companies --no-cleanup"completes and prints a JSON summarypopulateRunsConvex table withisBenchmark: truemake benchmark ARGS="--prompt us-national-parks"cleans up the temp dataset afterwardmake benchmark ARGS="--concurrency 2"runs two prompts in parallel without errors🤖 Generated with Claude Code