Skip to content

Add benchmark runner for populate workflow#101

Open
MMeteorL wants to merge 8 commits into
mainfrom
feature/benchmark-testing-2
Open

Add benchmark runner for populate workflow#101
MMeteorL wants to merge 8 commits into
mainfrom
feature/benchmark-testing-2

Conversation

@MMeteorL
Copy link
Copy Markdown
Collaborator

Summary

  • Adds a benchmark suite that exercises the real populateWorkflow directly (no HTTP layer), so results reflect actual agent performance and metrics land in the same populateRuns Convex table as live app sessions
  • 4 built-in prompts (YC companies, B2B SaaS, US national parks, AI research labs), each with isPrimaryKey annotations for proper dedup
  • make benchmark target for convenient invocation; supports --prompt, --out, --no-cleanup, --concurrency flags
  • Depends on: Add cost tracking for populate workflow runs #100 (cost tracking — populateRuns table, createInternal/deleteInternal mutations)

Changes

  • benchmarks/run.mts — runner: creates temp Convex datasets, calls populateWorkflow, reads back metrics, emits JSON summary
  • benchmarks/prompts.json — 4 test prompts with isPrimaryKey on each natural PK column
  • benchmarks/README.md — usage guide, JSON output format, Convex CLI queries, cost estimation; includes note that enumeration-step tokens are not captured in RunMetrics
  • frontend/convex/datasets.tscreateInternal / deleteInternal internal mutations for benchmark dataset lifecycle (no Clerk auth required)
  • makefiles/Makefilebenchmark target

Test plan

  • make dev is running with .env filled in
  • make benchmark ARGS="--prompt yc-recent-batch-companies --no-cleanup" completes and prints a JSON summary
  • A row appears in the populateRuns Convex table with isBenchmark: true
  • make benchmark ARGS="--prompt us-national-parks" cleans up the temp dataset afterward
  • make benchmark ARGS="--concurrency 2" runs two prompts in parallel without errors

🤖 Generated with Claude Code

MMeteorL and others added 2 commits May 27, 2026 16:47
Records search/fetch/investigate call counts, token usage (input/output),
and timing per populate run to a new Convex `populateRuns` table. Metrics
are counted by scanning result.steps[].toolCalls after agent.generate()
returns — no factory wrappers needed. The save is fire-and-forget in the
finally block so it never blocks or fails the workflow.

New files:
- backend/src/mastra/run-metrics.ts — RunMetrics accumulator class
- backend/src/mastra/save-run-metrics.ts — writes completed run to Convex
- frontend/convex/populateRuns.ts — insert mutation + list/get queries

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Move buildPopulateAgent into the try block so any construction error
  is caught and the finally block (metrics save) always executes
- Add status/error consistency guard to populateRuns.insert so a
  "success" record can't carry an error string and an "error" record
  can't be written without a message

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces a complete benchmarking framework for the populate workflow. It adds internal Convex mutations to create and delete datasets without authentication, a TypeScript benchmark runner that orchestrates workflow invocation with configurable concurrency, and infrastructure to collect and aggregate metrics into structured JSON reports. Four benchmark datasets are predefined, comprehensive documentation covers usage and metric interpretation, and a Make target integrates the runner into the development environment.

Suggested reviewers

  • simantak-dabhade
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main change: adding a benchmark runner for the populate workflow.
Description check ✅ Passed The description comprehensively covers the changes, objectives, test plan, and dependencies; all content is directly related to the benchmark suite being added.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/benchmark-testing-2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Renames the Convex table and module that stores per-run metrics from
populateRuns to runStats for a cleaner, more descriptive name.

- frontend/convex/populateRuns.ts → runStats.ts
- schema.ts: table definition populateRuns → runStats
- save-run-metrics.ts: internal.populateRuns.insert → internal.runStats.insert
- run-metrics.ts: update comment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@MMeteorL MMeteorL force-pushed the feature/benchmark-testing-2 branch from 1d85d2c to f95611c Compare May 28, 2026 04:55
Two bugs in the Mastra FullOutput step scanning:

1. Tool name is nested at tc.payload.toolName, not tc.toolName
   Mastra's ToolCallChunk wraps everything in a payload object
   (type: 'tool-call', payload: { toolName, toolCallId, args }).
   The scan now reads tc.payload?.toolName with tc.toolName as fallback.

2. result.usage is last-step tokens only; result.totalUsage is the sum
   across all steps (per Mastra FullOutput type definition). Switch
   addOrchestratorResult/addInvestigateResult to prefer totalUsage so
   multi-step runs aren't undercounted.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@MMeteorL MMeteorL force-pushed the feature/benchmark-testing-2 branch from f95611c to 54d1251 Compare May 28, 2026 06:20
MMeteorL and others added 4 commits May 28, 2026 00:05
Per-step toolCalls snapshots are captured at step-finish time. Tool-call
chunks that arrive after their step's step-finish event (e.g. non-streaming
tool-call events that follow a streaming-end synthetic) land in the next
step's buffer, causing systematic miscounts.

result.toolCalls is the flat accumulated list maintained by Mastra's stream
processor as chunks arrive — it is the authoritative source. Using it
directly avoids the step-ordering sensitivity entirely.

Also adds toolCalls length to the [run_subagent] done log line for easier
verification.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a benchmark suite that exercises the real populateWorkflow directly
(no HTTP layer) so metrics reflect actual agent performance. Results are
stored in the same populateRuns Convex table as live app sessions, tagged
with isBenchmark: true for filtering.

- benchmarks/run.mts: runner that creates temporary Convex datasets, calls
  populateWorkflow, reads back metrics, and emits a JSON summary
- benchmarks/prompts.json: 4 test prompts (YC companies, B2B SaaS, national
  parks, AI research labs)
- benchmarks/README.md: instructions for running benchmarks and querying
  live run stats via the Convex dashboard or CLI
- datasets.createInternal / deleteInternal: Convex mutations for benchmark
  dataset lifecycle (no Clerk auth required)
- make benchmark target for convenient invocation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add isPrimaryKey: true to the natural PK column in each of the 4
  benchmark prompts so the orchestrator gets explicit pk guidance and
  deduplication works correctly with the run_subagent tool's .refine()
  enforcement
- Fix investigateCalls description: investigate_row → run_subagent
- Add note that enumeration-step tokens are not captured in RunMetrics
- Update custom-prompt example to show isPrimaryKey usage
- Update cost table model name: Kimi K2 → DeepSeek V4 Pro

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Follows the rename in feature/cost-tracking. Updates the Convex query
call and all documentation references in run.mts and README.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@MMeteorL MMeteorL force-pushed the feature/benchmark-testing-2 branch from 54d1251 to a642945 Compare May 28, 2026 07:06
@MMeteorL
Copy link
Copy Markdown
Collaborator Author

@coderabbitai
Please try to review this PR again. This branch should only be merged after PR100.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

@MMeteorL Sure! I'll kick off a fresh review of this PR now, keeping in mind that it depends on #100 and should only be merged after that PR lands.


✏️ Learnings added
Learnt from: MMeteorL
URL: https://github.com/tinyfish-io/bigset/pull/101

Timestamp: 2026-05-29T21:38:20.298Z
Learning: PR `#101` in tinyfish-io/bigset ("Add benchmark runner for populate workflow") depends on PR `#100` (cost tracking — `populateRuns` table, `createInternal`/`deleteInternal` mutations) and should only be merged after PR `#100` is merged.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/README.md`:
- Around line 201-203: Add a language tag to the fenced log block containing
"[populate-agent] populate-agent start" so markdownlint MD040 stops flagging it;
change the block from ``` to ```text (i.e., replace the opening triple backticks
with ```text) to label the sample log as plain text.

In `@benchmarks/run.mts`:
- Around line 105-121: The parseArgs function currently accepts missing option
values and non-positive concurrency, which can cause hangs or silent failures;
update parseArgs to validate that options that require values (--prompt,
--prompts, --out, --concurrency) actually have a following argument (check i+1 <
argv.length before consuming argv[++i]) and throw or exit with a clear error if
missing, and after parsing ensure result.concurrency is a positive integer
(check Number.isInteger(concurrency) && concurrency > 0 and reject
NaN/0/negative values), also guard against parseInt producing NaN for
--concurrency and fail fast rather than allowing undefined/NaN to propagate.
Include references to parseArgs, promptIds, promptsFile, outDir, and concurrency
in the changes.
- Around line 237-285: In buildSummary, stop filtering out successful runs that
lack metrics: keep successCount based on status === "success" (e.g., compute
successful = results.filter(r => r.status === "success") and failed =
results.filter(r => r.status !== "success")), but compute totals only from runs
that actually have metrics (e.g., metricsAvailable = successful.filter(r =>
r.metrics) and use metricsAvailable in the reduce and for n for averages).
Update referenced symbols: buildSummary, successful, failed, totals, and
perRunAverages so promptCount === successCount + failureCount and averages are
computed from metricsAvailable (with a fallback n = metricsAvailable.length ||
1).

In `@frontend/convex/datasets.ts`:
- Around line 251-277: The createInternal internalMutation currently accepts
caller-supplied ownerId and visibility which lets internal callers mint
user-owned or public datasets; remove ownerId and visibility from the args
schema (delete ownerId and visibility from the args block) and in the handler
hard-code ownerId to "benchmark-runner" and visibility to "private" (keep other
fields and behavior unchanged) so benchmark datasets are always owned by the
fixed synthetic user and always private; reference createInternal and the
internalMutation args/handler when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b00257fa-79af-4a5d-8957-e442d8832290

📥 Commits

Reviewing files that changed from the base of the PR and between 97904b3 and a642945.

📒 Files selected for processing (5)
  • benchmarks/README.md
  • benchmarks/prompts.json
  • benchmarks/run.mts
  • frontend/convex/datasets.ts
  • makefiles/Makefile

Comment thread benchmarks/README.md
Comment on lines +201 to +203
```
[populate-agent] populate-agent start
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language tag to this fenced block.

markdownlint MD040 will keep failing here until this sample log block is labeled. text is enough.

Proposed fix
-```
+```text
 [populate-agent] populate-agent start
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.22.1)</summary>

[warning] 201-201: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @benchmarks/README.md around lines 201 - 203, Add a language tag to the
fenced log block containing "[populate-agent] populate-agent start" so
markdownlint MD040 stops flagging it; change the block from totext
(i.e., replace the opening triple backticks with ```text) to label the sample
log as plain text.


</details>

<!-- fingerprinting:phantom:poseidon:grasshopper -->

<!-- This is an auto-generated comment by CodeRabbit -->

Comment thread benchmarks/run.mts
Comment on lines +105 to +121
function parseArgs(argv: string[]) {
const result = {
promptIds: [] as string[],
promptsFile: join(dirname(fileURLToPath(import.meta.url)), "prompts.json"),
outDir: null as string | null,
cleanup: true,
concurrency: 1,
};
for (let i = 0; i < argv.length; i++) {
const arg = argv[i];
if (arg === "--prompt") result.promptIds.push(argv[++i]);
else if (arg === "--prompts") result.promptsFile = argv[++i];
else if (arg === "--out") result.outDir = argv[++i];
else if (arg === "--no-cleanup") result.cleanup = false;
else if (arg === "--concurrency") result.concurrency = parseInt(argv[++i], 10);
}
return result;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject missing option values and non-positive concurrency.

--concurrency 0 makes the loop at Line 347 never advance, and a missing value after --prompt, --prompts, --out, or --concurrency currently becomes undefined/NaN silently. Fail fast during parsing so the runner cannot hang or no-op on malformed CLI input.

Proposed fix
 function parseArgs(argv: string[]) {
   const result = {
     promptIds: [] as string[],
     promptsFile: join(dirname(fileURLToPath(import.meta.url)), "prompts.json"),
     outDir: null as string | null,
     cleanup: true,
     concurrency: 1,
   };
+  const requireValue = (flag: string, value: string | undefined) => {
+    if (!value || value.startsWith("--")) {
+      throw new Error(`${flag} requires a value`);
+    }
+    return value;
+  };
   for (let i = 0; i < argv.length; i++) {
     const arg = argv[i];
-    if (arg === "--prompt") result.promptIds.push(argv[++i]);
-    else if (arg === "--prompts") result.promptsFile = argv[++i];
-    else if (arg === "--out") result.outDir = argv[++i];
+    if (arg === "--prompt") result.promptIds.push(requireValue(arg, argv[++i]));
+    else if (arg === "--prompts") result.promptsFile = requireValue(arg, argv[++i]);
+    else if (arg === "--out") result.outDir = requireValue(arg, argv[++i]);
     else if (arg === "--no-cleanup") result.cleanup = false;
-    else if (arg === "--concurrency") result.concurrency = parseInt(argv[++i], 10);
+    else if (arg === "--concurrency") {
+      const raw = requireValue(arg, argv[++i]);
+      const parsed = Number.parseInt(raw, 10);
+      if (!Number.isInteger(parsed) || parsed < 1) {
+        throw new Error(`--concurrency must be a positive integer, received "${raw}"`);
+      }
+      result.concurrency = parsed;
+    }
   }
   return result;
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/run.mts` around lines 105 - 121, The parseArgs function currently
accepts missing option values and non-positive concurrency, which can cause
hangs or silent failures; update parseArgs to validate that options that require
values (--prompt, --prompts, --out, --concurrency) actually have a following
argument (check i+1 < argv.length before consuming argv[++i]) and throw or exit
with a clear error if missing, and after parsing ensure result.concurrency is a
positive integer (check Number.isInteger(concurrency) && concurrency > 0 and
reject NaN/0/negative values), also guard against parseInt producing NaN for
--concurrency and fail fast rather than allowing undefined/NaN to propagate.
Include references to parseArgs, promptIds, promptsFile, outDir, and concurrency
in the changes.

Comment thread benchmarks/run.mts
Comment on lines +237 to +285
function buildSummary(results: RunResult[]) {
const successful = results.filter((r) => r.status === "success" && r.metrics);
const failed = results.filter((r) => r.status !== "success");

const totals = successful.reduce(
(acc, r) => {
const m = r.metrics!;
acc.rowsInserted += m.rowsInserted;
acc.searchCalls += m.searchCalls;
acc.fetchCalls += m.fetchCalls;
acc.investigateCalls += m.investigateCalls;
acc.tokensInput += m.tokensInput;
acc.tokensOutput += m.tokensOutput;
acc.orchestratorSteps += m.orchestratorSteps;
acc.investigateSteps += m.investigateSteps;
acc.investigateRuns += m.investigateRuns;
acc.durationMs += r.durationMs;
return acc;
},
{
rowsInserted: 0,
searchCalls: 0,
fetchCalls: 0,
investigateCalls: 0,
tokensInput: 0,
tokensOutput: 0,
orchestratorSteps: 0,
investigateSteps: 0,
investigateRuns: 0,
durationMs: 0,
},
);

const n = successful.length || 1;

return {
completedAt: new Date().toISOString(),
promptCount: results.length,
successCount: successful.length,
failureCount: failed.length,
aggregate: totals,
perRunAverages: {
rowsInserted: +(totals.rowsInserted / n).toFixed(1),
searchCalls: +(totals.searchCalls / n).toFixed(1),
fetchCalls: +(totals.fetchCalls / n).toFixed(1),
investigateCalls: +(totals.investigateCalls / n).toFixed(1),
tokensTotal: +((totals.tokensInput + totals.tokensOutput) / n).toFixed(0),
durationSeconds: +((totals.durationMs / n / 1000).toFixed(1)),
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Count successful workflows separately from metrics availability.

A run can finish with status: "success" and still miss metrics if the async write lands after the fixed sleep in runPrompt. Right now those runs disappear from both successCount and failureCount, so promptCount can exceed the sum of both counters and the averages get underreported.

Proposed fix
 function buildSummary(results: RunResult[]) {
-  const successful = results.filter((r) => r.status === "success" && r.metrics);
+  const successful = results.filter((r) => r.status === "success");
+  const successfulWithMetrics = successful.filter((r) => r.metrics);
   const failed = results.filter((r) => r.status !== "success");
 
-  const totals = successful.reduce(
+  const totals = successfulWithMetrics.reduce(
     (acc, r) => {
       const m = r.metrics!;
       acc.rowsInserted += m.rowsInserted;
@@
-  const n = successful.length || 1;
+  const n = successfulWithMetrics.length || 1;
 
   return {
     completedAt: new Date().toISOString(),
     promptCount: results.length,
-    successCount: successful.length,
+    successCount: successful.length,
     failureCount: failed.length,
+    missingMetricsCount: successful.length - successfulWithMetrics.length,
     aggregate: totals,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/run.mts` around lines 237 - 285, In buildSummary, stop filtering
out successful runs that lack metrics: keep successCount based on status ===
"success" (e.g., compute successful = results.filter(r => r.status ===
"success") and failed = results.filter(r => r.status !== "success")), but
compute totals only from runs that actually have metrics (e.g., metricsAvailable
= successful.filter(r => r.metrics) and use metricsAvailable in the reduce and
for n for averages). Update referenced symbols: buildSummary, successful,
failed, totals, and perRunAverages so promptCount === successCount +
failureCount and averages are computed from metricsAvailable (with a fallback n
= metricsAvailable.length || 1).

Comment on lines +251 to +277
/**
* Create a dataset without Clerk auth. Used exclusively by the benchmark
* runner, which calls populateWorkflow directly (no HTTP layer) and
* therefore has no Clerk identity. The ownerId must be supplied by the
* caller; for benchmark runs pass "benchmark-runner".
*/
export const createInternal = internalMutation({
args: {
name: v.string(),
description: v.string(),
cadence: v.string(),
columns: v.array(columnValidator),
ownerId: v.string(),
visibility: v.optional(v.union(v.literal("public"), v.literal("private"))),
},
handler: async (ctx, args) => {
return await ctx.db.insert("datasets", {
name: args.name,
description: args.description,
cadence: args.cadence,
columns: args.columns,
ownerId: args.ownerId,
status: "paused",
visibility: args.visibility ?? "private",
rowCount: 0,
});
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Lock benchmark datasets to a fixed synthetic owner.

This mutation is documented as benchmark-only, but it still accepts arbitrary ownerId and visibility. That makes it possible for any internal caller to mint user-owned or public datasets without going through the normal authz path. Since the current runner always uses benchmark-runner and private, those values should be fixed inside the mutation instead of being caller-controlled.

Proposed fix
 export const createInternal = internalMutation({
   args: {
     name: v.string(),
     description: v.string(),
     cadence: v.string(),
     columns: v.array(columnValidator),
-    ownerId: v.string(),
-    visibility: v.optional(v.union(v.literal("public"), v.literal("private"))),
   },
   handler: async (ctx, args) => {
     return await ctx.db.insert("datasets", {
       name: args.name,
       description: args.description,
       cadence: args.cadence,
       columns: args.columns,
-      ownerId: args.ownerId,
+      ownerId: "benchmark-runner",
       status: "paused",
-      visibility: args.visibility ?? "private",
+      visibility: "private",
       rowCount: 0,
     });
   },
 });

As per coding guidelines, "ownerId field on datasets stores identity.subject (Clerk user ID). Do not pass ownerId from the client."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/**
* Create a dataset without Clerk auth. Used exclusively by the benchmark
* runner, which calls populateWorkflow directly (no HTTP layer) and
* therefore has no Clerk identity. The ownerId must be supplied by the
* caller; for benchmark runs pass "benchmark-runner".
*/
export const createInternal = internalMutation({
args: {
name: v.string(),
description: v.string(),
cadence: v.string(),
columns: v.array(columnValidator),
ownerId: v.string(),
visibility: v.optional(v.union(v.literal("public"), v.literal("private"))),
},
handler: async (ctx, args) => {
return await ctx.db.insert("datasets", {
name: args.name,
description: args.description,
cadence: args.cadence,
columns: args.columns,
ownerId: args.ownerId,
status: "paused",
visibility: args.visibility ?? "private",
rowCount: 0,
});
},
/**
* Create a dataset without Clerk auth. Used exclusively by the benchmark
* runner, which calls populateWorkflow directly (no HTTP layer) and
* therefore has no Clerk identity. The ownerId must be supplied by the
* caller; for benchmark runs pass "benchmark-runner".
*/
export const createInternal = internalMutation({
args: {
name: v.string(),
description: v.string(),
cadence: v.string(),
columns: v.array(columnValidator),
},
handler: async (ctx, args) => {
return await ctx.db.insert("datasets", {
name: args.name,
description: args.description,
cadence: args.cadence,
columns: args.columns,
ownerId: "benchmark-runner",
status: "paused",
visibility: "private",
rowCount: 0,
});
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/convex/datasets.ts` around lines 251 - 277, The createInternal
internalMutation currently accepts caller-supplied ownerId and visibility which
lets internal callers mint user-owned or public datasets; remove ownerId and
visibility from the args schema (delete ownerId and visibility from the args
block) and in the handler hard-code ownerId to "benchmark-runner" and visibility
to "private" (keep other fields and behavior unchanged) so benchmark datasets
are always owned by the fixed synthetic user and always private; reference
createInternal and the internalMutation args/handler when making the change.

Base automatically changed from feature/cost-tracking to main May 30, 2026 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant