Skip to content

Add a canary leakage audit for rubric and ground-truth isolation #8

@sunghunkwag

Description

@sunghunkwag

Summary

Workspace-Bench already has several useful isolation boundaries: task metadata contains rubrics and expected outputs, agent runs and rubric judging are separate stages, and agent_as_a_judge.py builds a restricted judge_view that exposes original inputs plus candidate outputs while avoiding GT-like directories such as output, output_cc, or gt.

I think it would be useful to add a small canary leakage audit that proves these boundaries stay intact across harnesses and future runner changes.

Why this matters

For workspace benchmarks, leakage can be subtle. An agent does not need the full answer key to overfit; it may be enough to see:

  • rubric-only wording,
  • expected output hints beyond the task prompt,
  • reference output filenames plus nearby GT artifacts,
  • judge-only metadata during the agent execution phase,
  • or stale files left in a restored workspace.

This is not a bug report. The current code already appears to care about this boundary. The proposal is to make the boundary testable.

Proposed canary audit

Add a tiny synthetic task or CI fixture with explicit canary strings placed in different visibility zones:

  1. Agent-visible input canary

    • Present in normal input files.
    • The agent is allowed to read this.
  2. Rubric-only canary

    • Present only in metadata.json rubrics.
    • Should not appear in the tested agent's prompt, trace, or workspace during the agent-run phase.
  3. Ground-truth canary

    • Present only in GT-like directories such as output, output_cc, or gt.
    • Should not be copied into the tested agent workspace.
    • Should not be copied into judge_view.
  4. Candidate-output canary

    • Present only if the tested agent generated it.
    • The judge is allowed to see this through judge_view/candidate_output.

Expected assertions:

agent phase:
  can see input canary
  cannot see rubric-only canary
  cannot see ground-truth canary

judge phase:
  can see input canary
  can see candidate-output canary
  can see rubrics if intentionally supplied to the judge
  cannot see ground-truth canary

Useful output

The audit could emit a small JSON report such as:

{
  "agent_visible_paths": ["..."],
  "judge_visible_paths": ["..."],
  "rubric_canary_seen_by_agent": false,
  "ground_truth_canary_seen_by_agent": false,
  "ground_truth_canary_seen_by_judge": false,
  "passed": true
}

This would make leakage failures easy to debug without exposing real benchmark answers.

Related reference

I have been working on a separate bounded verifier harness here:

https://github.com/sunghunkwag/rsi-metaforge-core

The relevant pattern is narrow: sealed hidden evaluations, explicit train-only rejection, and evidence that hidden expectations/scoring artifacts are not exposed to the adaptive loop. This is not an AGI claim; it is just a verifier-discipline pattern that seems relevant to Workspace-Bench's rubric and workspace isolation story.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions