Skip to content

[FE Fix]: Re-enable full-page playground for evaluator workflows#4474

Open
ardaerzin wants to merge 22 commits into
mainfrom
fe-fix/app-workflow-router-unification-regression-fix
Open

[FE Fix]: Re-enable full-page playground for evaluator workflows#4474
ardaerzin wants to merge 22 commits into
mainfrom
fe-fix/app-workflow-router-unification-regression-fix

Conversation

@ardaerzin
Copy link
Copy Markdown
Contributor

@ardaerzin ardaerzin commented May 28, 2026

Summary

PR #4384 disabled EVALUATOR_FULL_PAGE_NAV_ENABLED because the app-style playground was a regression for evaluators (lost the upstream-app connection) and app-scoped observability defaulted to "invocation" instead of "annotation" for evaluator workflows. This change addresses both blockers and re-enables the flow by default.

Playground

  • added app chaining for evaluator workflows
  • minor ui fixes

Observability

  • fixed and improved filtering for evaluator workflows

Other

closes #4525

QA follow-up

  • full app pages router tests for evaluator workflows, and checking against reasons why we disabled this feature after its initial release

Demo

Checklist

  • I have included a video or screen recording for UI changes, or marked Demo as N/A
  • Relevant tests pass locally
  • Relevant linting and formatting pass locally
  • I have signed the CLA, or I will sign it when the bot prompts me

Contributor Resources

@vercel
Copy link
Copy Markdown

vercel Bot commented May 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agenta-documentation Ready Ready Preview, Comment Jun 5, 2026 9:57pm

Request Review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR enables Phase 5 evaluator full-page playground navigation, adds app connect/disconnect flows derived from the playground node graph, and refactors observability trace_type defaults, persisted filter composition, and UI reconciliation to be workflow-aware.

Changes

Evaluator Full-Page Navigation & Observability Integration

Layer / File(s) Summary
Trace type defaults helper and workflow schema exports
web/packages/agenta-entities/src/workflow/core/traceTypeDefault.ts, web/packages/agenta-entities/src/workflow/core/index.ts, web/packages/agenta-entities/src/workflow/index.ts, web/packages/agenta-entities/tests/unit/traceTypeDefault.test.ts, web/packages/agenta-entities/src/workflow/core/schema.ts
Adds defaultTraceTypeForWorkflow plus types and unit tests; documents workflow_slug/workflow_variant_slug usage for evaluator reference chains.
Feature flag and router navigation
web/oss/src/state/workflow/flags.ts, web/oss/src/components/PlaygroundRouter/index.tsx
Enables EVALUATOR_FULL_PAGE_NAV_ENABLED and routes non-feedback evaluator workflows to ConfigureEvaluatorPage.
App connection state management
web/oss/src/components/Evaluators/components/ConfigureEvaluator/atoms.ts
Derives selected app label from playground node graph; persists selection only after graph mutations succeed; adds disconnectAppFromEvaluatorAtom.
Evaluator header UI with app controls
web/oss/src/components/Evaluators/components/ConfigureEvaluator/EvaluatorPlaygroundHeader.tsx
Adds disconnect button (picker footer and standalone icon), handleDisconnect, computes hasAppSelected, and renders TestsetDropdown unconditionally with updated comment.
ConfigureEvaluatorPage for full-page mode
web/oss/src/components/Evaluators/components/ConfigureEvaluator/index.tsx
Removes run-disabled gating and inline picker prompt; wires handleAppSelect; sets parentLabel: "Application" in app workflow adapter.
Evaluators registry row-click simplified
web/oss/src/components/Evaluators/index.tsx
Simplifies row-click routing to use EVALUATOR_FULL_PAGE_NAV_ENABLED && workflowId.
Drawer navigation and post-create flow
web/oss/src/components/WorkflowRevisionDrawerWrapper/index.tsx, web/oss/src/components/pages/evaluations/NewEvaluation/Components/CreateEvaluatorDrawer/index.tsx
Restores persisted app via connectApp (no direct label writes), removes explicit label clears, adds parentLabel: "Application", and simplifies post-create navigation gate.
Sidebar evaluator switcher gating
web/oss/src/components/Sidebar/components/WorkflowEntityCard.tsx
Builds switcher from nonArchivedEvaluatorsAtom gated by the feature flag and adds Evaluators group conditionally.
Trace type state persistence and derivation
web/oss/src/state/newObservability/atoms/controls.ts
Adds TraceTypeChoice, persists per-app/per-tab filters, and exposes effectiveTraceTypeAtomFamily derived using workflow defaults.
Filter regeneration and scope composition
web/oss/src/state/newObservability/atoms/controls.ts, web/oss/src/state/newObservability/atoms/queries.ts
filtersAtomFamily composes permanent scope filter + derived trace_type row + persisted user filters; tracesQueryAtom uses effectiveAppId and waits for workflow context resolution.
Filter UI reconciliation for evaluators
web/oss/src/components/Filters/Filters.tsx, web/oss/src/components/Filters/types.d.ts, web/oss/src/components/pages/observability/assets/filters/fieldAdapter.ts, web/oss/src/components/pages/observability/components/ObservabilityHeader/index.tsx
Adds reconcileFilterRows display-only projection; fieldAdapter adds referenceCategory and de-duplicates UI values; ObservabilityHeader supplies reconciler to remap reference categories based on derived trace_type.
Workflow adapter labeling
web/packages/agenta-entity-ui/src/selection/adapters/workflowRevisionRelationAdapter.ts, web/oss/src/components/Playground/Components/PlaygroundVariantConfig/assets/PlaygroundVariantConfigHeader.tsx
Adds parentLabel option (defaults to "Evaluator"); applied as "Application" in evaluator contexts to adjust UI copy and messages.
Evaluator trace reference construction
web/packages/agenta-playground/src/state/execution/executionRunner.ts
Adds buildEvaluatorSelfReferences and merges evaluator self-references into non-root stage references.
Playwright test coverage
web/oss/tests/playwright/acceptance/evaluators/tests.ts, web/oss/tests/playwright/acceptance/evaluators/index.ts
Exports new test constants and template name; rewrites and expands acceptance tests to exercise full-page playground flows and navigation gating.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Agenta-AI/agenta#4384: Related predecessor implementing earlier gating and follow-ups for evaluator full-page playground enablement.
  • Agenta-AI/agenta#4229: Related changes touching EvaluatorsRegistry row interactions and navigation behavior.
  • Agenta-AI/agenta#4265: Related observability/defaults updates used by this PR.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 60.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly identifies the main change: re-enabling the full-page playground feature for evaluator workflows, which is the primary objective addressed across all modified files.
Description check ✅ Passed The description clearly addresses the PR objectives by explaining the regressions being fixed and the changes made to re-enable the full-page playground for evaluators.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fe-fix/app-workflow-router-unification-regression-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ardaerzin ardaerzin marked this pull request as ready for review May 28, 2026 11:20
@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. Frontend labels May 28, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
web/oss/src/components/Evaluators/components/ConfigureEvaluator/atoms.ts (1)

165-178: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Persisted app selection can get stale on failed connect/disconnect edge paths.

persistedAppSelectionAtom is written before the primary-node swap succeeds, and disconnect exits early without clearing persisted state when no downstream node is found. That can rehydrate an app selection that is no longer actually connected.

Proposed fix
 export const connectAppToEvaluatorAtom = atom(
@@
-        // Persist across sessions. The picker display label is derived from
-        // the depth-0 node's `label` via `selectedAppLabelAtom`, so no extra
-        // write needed here.
-        set(persistedAppSelectionAtom, {appRevisionId, appLabel})
-
         // Replace primary node with app
         const nodeId = set(playgroundController.actions.changePrimaryNode, {
             type: "workflow",
             id: appRevisionId,
             label: appLabel,
         })
 
         if (!nodeId) return
+        // Persist only after graph mutation succeeds.
+        set(persistedAppSelectionAtom, {appRevisionId, appLabel})
@@
 export const disconnectAppFromEvaluatorAtom = atom(null, (get, set) => {
     const nodes = get(playgroundController.selectors.nodes())
     const downstreamEvaluator = nodes.find((n) => n.depth > 0)
-    if (!downstreamEvaluator) return
+    if (!downstreamEvaluator) {
+        set(persistedAppSelectionAtom, null)
+        return
+    }

Also applies to: 208-225


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: ce60569f-f33c-480b-a472-4ceb822d0b1e

📥 Commits

Reviewing files that changed from the base of the PR and between 0b9012d and 048d662.

📒 Files selected for processing (25)
  • web/oss/src/components/Evaluators/components/ConfigureEvaluator/EvaluatorPlaygroundHeader.tsx
  • web/oss/src/components/Evaluators/components/ConfigureEvaluator/atoms.ts
  • web/oss/src/components/Evaluators/components/ConfigureEvaluator/index.tsx
  • web/oss/src/components/Evaluators/index.tsx
  • web/oss/src/components/Filters/Filters.tsx
  • web/oss/src/components/Filters/types.d.ts
  • web/oss/src/components/Playground/Components/PlaygroundVariantConfig/assets/PlaygroundVariantConfigHeader.tsx
  • web/oss/src/components/PlaygroundRouter/index.tsx
  • web/oss/src/components/Sidebar/components/WorkflowEntityCard.tsx
  • web/oss/src/components/WorkflowRevisionDrawerWrapper/index.tsx
  • web/oss/src/components/pages/evaluations/NewEvaluation/Components/CreateEvaluatorDrawer/index.tsx
  • web/oss/src/components/pages/observability/assets/filters/fieldAdapter.ts
  • web/oss/src/components/pages/observability/components/ObservabilityHeader/index.tsx
  • web/oss/src/state/newObservability/atoms/controls.ts
  • web/oss/src/state/newObservability/atoms/queries.ts
  • web/oss/src/state/workflow/flags.ts
  • web/oss/tests/playwright/acceptance/evaluators/index.ts
  • web/oss/tests/playwright/acceptance/evaluators/tests.ts
  • web/packages/agenta-entities/src/workflow/core/index.ts
  • web/packages/agenta-entities/src/workflow/core/schema.ts
  • web/packages/agenta-entities/src/workflow/core/traceTypeDefault.ts
  • web/packages/agenta-entities/src/workflow/index.ts
  • web/packages/agenta-entities/tests/unit/traceTypeDefault.test.ts
  • web/packages/agenta-entity-ui/src/selection/adapters/workflowRevisionRelationAdapter.ts
  • web/packages/agenta-playground/src/state/execution/executionRunner.ts

Comment thread web/oss/src/components/PlaygroundRouter/index.tsx Outdated
Comment thread web/oss/src/components/Sidebar/components/WorkflowEntityCard.tsx Outdated
Comment thread web/oss/src/state/newObservability/atoms/controls.ts
Comment thread web/oss/tests/playwright/acceptance/evaluators/index.ts Outdated
ardaerzin added a commit that referenced this pull request May 28, 2026
CodeRabbit flagged 5 issues on the evaluator-full-page rollout PR.
This commit addresses each:

1. PlaygroundRouter — `is_feedback` evaluators skip the full-page swap.
   `workflowKind === "evaluator"` was too broad. Human/feedback
   evaluators are drawer-only in /evaluators (they capture human input,
   they don't run), so routing them to ConfigureEvaluatorPage produced
   a run-controls UI for a workflow with nothing to run. Added a
   `flags.is_feedback` exclusion next to the workflowKind check.

2. Sidebar — switcher filters out `is_feedback` evaluators.
   `nonArchivedEvaluatorsAtom` only filters by `deleted_at` and
   includes human evaluators; the switcher was exposing entries that,
   when clicked, would land on the (now-correctly-gated) generic
   <Playground /> for a feedback workflow. Filtered the list at the
   switcher boundary.

3. controls.ts — handle array-valued `trace_type` for in/not_in.
   The dialog dispatches `{operator: "in", value: ["annotation"]}` for
   the IN operator family, but the intent setter only normalized
   scalars — so the user's choice was silently dropped to
   `{kind: "cleared"}`. Normalize to an array, filter to enum values,
   and collapse single-value arrays back to a scalar. Multi-value
   selections (which mean "no filter" for a 2-value enum) still map
   to `cleared`.

4. Playwright — drop stale `[data-row-key]` poll in select-app-and-run.
   The test asserted post-create navigation to /apps/<id>/playground
   AFTER polling for the new row in the evaluators table — but the
   redirect wins first, the table disappears, and the poll became a
   timing-dependent failure. Removed the registry-side wait;
   evaluator-in-registry assertion is covered by the
   post-create-row-click test alongside.

5. ConfigureEvaluator/atoms.ts — fix persistedAppSelectionAtom race.
   `connectAppToEvaluatorAtom` persisted the app selection BEFORE
   `changePrimaryNode` ran, so a failed swap (returns `null` with no
   primary to swap from) left a stale localStorage record that the
   next mount re-hydrated into a phantom "connected" state. Moved the
   persist call to after both graph mutations succeed.
   `disconnectAppFromEvaluatorAtom` early-returned on no-downstream
   without clearing the persisted state, allowing the same phantom
   record to survive a disconnect attempt. Clear it on that branch
   too.

No behavior change for the happy-path full-page flow — these all
narrow edge cases the reviewer flagged.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

Railway Preview Environment

Preview URL https://gateway-production-b67c.up.railway.app/w
Project agenta-oss-pr-4474
Image tag pr-4474-98073c5
Status Deployed
Railway logs Open logs
Workflow logs View workflow run
Updated at 2026-06-05T22:06:37.149Z

@junaway junaway changed the base branch from main to release/v0.100.7 May 29, 2026 12:46
@ardaerzin ardaerzin self-assigned this May 31, 2026
@mmabrouk mmabrouk marked this pull request as draft June 1, 2026 07:08
@mmabrouk mmabrouk marked this pull request as ready for review June 1, 2026 07:09
@ardaerzin ardaerzin changed the title fix(frontend): re-enable full-page playground for evaluator workflows [FE Fix]: Re-enable full-page playground for evaluator workflows Jun 1, 2026
Copy link
Copy Markdown
Member

@mmabrouk mmabrouk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ardaerzin
Copy link
Copy Markdown
Contributor Author

Failed to build -> https://github.com/Agenta-AI/agenta_cloud/actions/runs/26751245314/job/78839925341

fail is not related to a frontend change tho

@mmabrouk mmabrouk changed the base branch from release/v0.100.7 to main June 2, 2026 13:22
ardaerzin added a commit that referenced this pull request Jun 2, 2026
…ssion-fix

Resolved conflicts:
- web/oss/src/components/Filters/Filters.tsx — kept this branch's
  `displayedFilter` (reconciles filter rows for evaluator workflows) with
  main's `filterContainerClass` plain-class styling.
- web/packages/agenta-playground/src/state/execution/executionRunner.ts —
  kept this branch's `stageReferences` builder which merges upstream app
  references with the evaluator's own self-references (via
  `buildEvaluatorSelfReferences`). Main's variant dropped references for
  evaluator stages, which was the regression PR #4474 is fixing — evaluator
  traces need `references.evaluator.slug` attached so they are searchable
  on the evaluator's /apps/<evalId>/traces page.
ardaerzin and others added 14 commits June 4, 2026 18:45
…ssion-fix

Resolves a single conflict in
`web/packages/agenta-entities/src/workflow/core/schema.ts` —
release v0.100.4 added `artifact_slug` / `variant_slug` to the
revision schema alongside the `workflow_slug` /
`workflow_variant_slug` fields this branch had introduced for
emitting evaluator references on playground chain runs.

Both sides added `workflow_slug` and `workflow_variant_slug`
with overlapping intent; resolution keeps all four fields
and merges the two doc comments into one that covers both
purposes (parent-workflow identification for ID-less callers
+ evaluator chain-trace emission).

No source behavior change — schema is additive on both sides.
…ssion-fix

Resolved conflicts:
- web/oss/src/components/Filters/Filters.tsx — kept this branch's
  `displayedFilter` (reconciles filter rows for evaluator workflows) with
  main's `filterContainerClass` plain-class styling.
- web/packages/agenta-playground/src/state/execution/executionRunner.ts —
  kept this branch's `stageReferences` builder which merges upstream app
  references with the evaluator's own self-references (via
  `buildEvaluatorSelfReferences`). Main's variant dropped references for
  evaluator stages, which was the regression PR #4474 is fixing — evaluator
  traces need `references.evaluator.slug` attached so they are searchable
  on the evaluator's /apps/<evalId>/traces page.
Issue: In the LLM-as-a-judge playground, switching the chained app from a
chat application to a completion application kept sending `context` and
`messages` from the previous app in the new request body.

Root cause: At `executionRunner.ts` for depth=0 (the root entity), the
runner spreads the entire row's data into `nodeInputs` (`{...data}`) and
hands it to the stage handle as `inputValues`. The downstream filter in
`resolveVariableValues` / `buildCompletionInputRow` correctly drops keys
that aren't in the entity's input variables — when `variables` is non-
empty. But when the entity's input ports haven't resolved yet (entity
mid-hydration) or genuinely declares no input variables, that filter
falls back to "spread every key from the row", which is exactly the
window in which stale chat-shape keys (`messages`, `context`) leak into
a completion request.

Fix: Filter `data` at the runner against the entity's declared
`inputSchema.properties` BEFORE building `nodeInputs`. This applies to
both the first execution (line ~417) and the repetition retries (line
~689). When the entity has no resolvable input schema, the helper falls
back to `{...data}` so workflows that genuinely depend on free-form
input (e.g. `__rawBody` app workflows whose variables live in
`__meta.variables`) keep working.

The fix is safe for chat mode: chat strips `messages` separately at line
587 of `executionItems.ts` and rebuilds the conversation from
`chatHistory` via `messageIdsAtomFamily(loadableId)` — independent of
`inputValues`.

Defense-in-depth: this complements the existing
`resolveVariableValues` filter rather than replacing it.
The evaluator info notice in SingleLayout rendered with hardcoded
light-mode colors (bg-blue-50, text-gray-700) and was unreadable
against the dark UI. Add dark: variants to background, border,
icon, body text, and dismiss button to match the existing
dark:bg-blue-900/* pattern used elsewhere in the app.
Previous attempt used dark:text-gray-200 which conflicted with the
themeAwareColors CSS-variable layer — the gray scale is role-inverted
in dark mode, so dark:text-gray-200 resolved to a dark shade against
the dark callout background.

Switch overrides to the blue scale (not theme-flipped): dark:text-blue-50
for body text, dark:text-blue-300 for the icon, and dark:text-blue-200
for the dismiss button. All readable against dark:bg-blue-900/20.
The first #4525 fix only covered the depth=0 (root entity) path. In
the LLM-as-a-judge evaluator playground the chained app sits at
depth>0, where input construction goes through resolveChainInputs
(spreads testcaseData on the no-mapping branch) or
buildEvaluatorExecutionInputs (spreads testcaseData when the schema
allows additionalProperties). Both paths re-leak the stale
`messages` field from a previous chat app into the current target
entity's request body.

Add stripChatTransportForEntity — a targeted strip of known chat-
transport keys (currently just `messages`) that runs unless the
target entity's input schema explicitly declares them. Applied:

- depth=0 path: as a defense-in-depth pass after the strict
  filterDataToEntityInputSchema, so the spread fallback (taken
  while the new app's schema is mid-hydration) can't leak the
  stale field either.
- depth=0 repetition path: same.
- depth>0 path: pre-filters `data` before chain / evaluator input
  construction. Uses a targeted strip (rather than the strict
  schema filter) so evaluators that legitimately depend on
  additionalProperties: true spread of testcase columns keep
  receiving them.

The helper short-circuits to the input reference when no chat
transport keys are present, so there's no allocation in the
common path.
…chat keys

Diagnostic telemetry for #4525 / AGE-3793 — three console.warn signals
in executionRunner so we can tell which layer is actually rescuing the
request body during a chat→completion swap:

1. filterDataToEntityInputSchema schema-not-resolved fallback — the
   strict allow-list can't run because workflowMolecule.selectors
   .ioSchemas returned no inputSchema.properties. Logs the entityId,
   the reason (no-properties vs properties-not-object), the data keys
   present, and whether `messages` is among them.
2. filterDataToEntityInputSchema empty-properties fallback — schema
   resolved but Object.keys(properties).length === 0. Same payload.
3. stripChatTransportForEntity strip — emits only when a chat-transport
   key was actually dropped, with which keys and whether the entity
   schema was resolved at the time of the strip.

All three are warn-level so they're visible in production console
without code changes, and gated to the unusual paths so the happy
path stays quiet.
…4525)

Move the stale-key fix from execution-time stripping to the layer
where it belongs: the testcase row store, on swap of the primary
entity. The testcaseMolecule is shared across loadables, so when the
user swaps the chained app in the LLM-as-a-judge playground (anchor
positional swap in setEntityIdsAtom), the same rows now carry every
key the previous primary populated — `messages` from a prior chat
app, completion variables from a prior completion app, etc.

Reconciliation strategy (decided with the user):
- Closed schema (additionalProperties: false): drop any row key not
  declared by the new entity's inputSchema.properties. Drops silently
  — no toast, no confirm modal. Matches what the user typed for the
  new app and nothing more.
- Open schema (additionalProperties not set or true): only strip the
  CHAT_TRANSPORT_KEYS set (currently `messages`). Evaluator
  workflows that legitimately depend on additionalProperties spread
  keep receiving their extra testcase columns.
- Schema not resolved: skip. The execution-time strip in
  executionRunner.ts is the fallback during this hydration window —
  it will be removed in a follow-up commit once the row-layer fix is
  verified end-to-end and a reactive deferred reconciliation handles
  the hydration race.

Mutation goes through testcaseMolecule.actions.batchUpdate with
stale keys set to `undefined` (the store's update reducer
interprets that as a delete). Drafts are created per affected row.

A console.warn is emitted in two cases:
- schema-not-resolved on swap (so we can verify the hydration race
  surface area in practice).
- one summary per swap that lists which keys were dropped per row
  and the schema mode (closed vs open).
…chema (#4525)

Root cause of the `context` leak that survived the prior fixes: both
the row prune and the runtime filter read the allow-list from
`workflowMolecule.selectors.ioSchemas(entityId).inputSchema.properties`,
which is EMPTY for completion apps. Completion apps express their
variables as prompt template placeholders surfaced through
`inputPorts`, not through the static input schema. So the filter
degraded to its empty-properties fallback (keep everything) and only
the hardcoded chat-transport strip removed `messages` — `context`
(a real chat template var, stale on the row) sailed through.

Diagnostic confirmation from the repro console:
  [executionRunner.filter] empty-properties fallback
    {entityId, dataKeys: ['messages','context','country'], hasMessagesKey: true}

Fix: new shared helper `state/helpers/entityInputContract.ts` that
resolves the allow-list the SAME way executionItems builds request
`variables`:
  variablesFromInputPorts = inputPorts[].key
  variablesFromPayload    = requestPayload.__meta.variables
                            ?? requestPayload.variables ?? []
  variables = inputPorts.length > 0 ? inputPorts : payload
  (+ `messages` when executionMode === 'chat')

`reconcileRowDataForEntity` applies the policy:
  - app with resolved contract → strict allow-list (drops context+messages)
  - evaluator → chat-transport-only strip (preserves additionalProperties
    spread of extra testcase columns)
  - unresolved contract → chat-transport-only safety strip

Both consumers now delegate to it:
  - playgroundController.pruneTestcaseRowsForEntity (swap-time, primary fix)
  - executionRunner.reconcileEntityInputData (exec-time hydration safety net)

This collapses the three ad-hoc helpers (filterDataToEntityInputSchema,
stripChatTransportForEntity, getEntityInputSchema) into one correct
source-of-truth resolution and removes the now-misleading
empty-properties / schema-not-resolved diagnostics.
…nputs (#4525)

The prior fix stripped stale keys from the APP's request inputs, but the
trace still showed `context` because it surfaces in the downstream
EVALUATOR's {inputs, outputs} envelope. The evaluator reads the SAME
shared testcase row, and the evaluator policy (chat-transport-only)
intentionally preserves non-`messages` keys — so `context` survived
there. The UI row also still showed it.

Also: the swap-time prune in setEntityIdsAtom never fired for this flow —
the evaluator playground selects the app via add/remove node actions in
ConfigureEvaluator, not a setEntityIds positional swap (no
[playgroundController.prune] log appeared in the repro).

Fix: reconcile the shared testcase row against the ROOT entity's input
contract in webWorkerIntegration, right before execution — path-agnostic,
fires on every run regardless of how the app was selected. The cleaned
row is:
  - passed to the runner (so app request AND evaluator envelope are clean),
  - written back via loadableController.actions.updateRow (so the UI and
    future runs reflect it; undefined values delete the keys).

Evaluator-referenced columns are protected: collectDownstreamReferencedColumns
gathers testcase columns named by downstream evaluator `<input>_key`
settings (e.g. correct_answer_key → ground_truth) and passes them as
protectedKeys, so a strict clean against the app contract never drops
intentional evaluation inputs.

reconcileRowDataForEntity gains an optional protectedKeys set; a key
survives strict filtering when it's in the app allow-list OR protected.

Emits [webWorker.reconcile] when keys are dropped, listing the strategy,
dropped keys, and protected columns.
…4525)

Two follow-ups now that the row-reconciliation fix is verified working:

1. Clean-on-swap: the evaluator playground selects an app via
   changePrimaryNode + connectDownstreamNode (ConfigureEvaluator), not a
   setEntityIds positional swap — so the previously-wired swap-time prune
   never fired there. Add playgroundController.actions.reconcileRowsToPrimary
   and call it from connectAppToEvaluatorAtom AFTER connectDownstreamNode, so
   the shared testcase row is cleaned the instant the app changes (not only at
   run time). Running after the downstream connect means the evaluator's
   referenced columns (correct_answer_key → ground_truth, etc.) are protected
   from the strict app-contract clean.

   pruneTestcaseRowsForEntity now:
   - collects downstream-evaluator protected columns,
   - returns a status ('acted' | 'noop' | 'unresolved').

   reconcileRowsToPrimary handles the hydration race: if the new primary's
   inputPorts aren't resolved yet AND the entity isn't loaded, it subscribes
   to inputPorts and retries once, then unsubscribes. If the entity is loaded
   but genuinely has no variables, it doesn't subscribe (no dangling sub). The
   run-time reconciliation in webWorkerIntegration remains the backstop.

2. Remove diagnostic logs added while tracing the bug:
   [executionRunner.filter], [webWorker.reconcile],
   [playgroundController.prune]. The reconcile + writeback logic stays; only
   the console.warn telemetry is dropped.
Adds a 'Run on' control to the evaluator (LLM-as-a-judge) playground header
so the first/empty state explains itself instead of leaving the user with two
disconnected loaders. Three modes, each drawing its own data-flow:

- Run directly on a test case  (Data -> Evaluator -> Score)
- Run on an app output         (Data -> App -> Output -> Evaluator -> Score) - default
- Run on a trace               (Trace -> Evaluator -> Score) - disabled for now

The mode is persisted per project; a connected app forces effective 'app' mode.
In app mode with no app connected, the run panel hides the testcases and shows a
centered 'Select an app' empty state (shared with the evaluator-creation drawer).
All colors come from the antd theme token so it follows light/dark mode.

Prompt playground is intentionally untouched.
@mmabrouk mmabrouk marked this pull request as ready for review June 5, 2026 09:28
@dosubot dosubot Bot added the UI label Jun 5, 2026
feat(frontend): Run-on mode selector for the evaluator playground
… evals from switcher

QA round 2 (2026-06-05):

[A] LLM-as-judge chain runs could ship a `local-…` draft id as
references.evaluator_revision.id, which the backend rejects as a non-UUID
(422). buildEvaluatorSelfReferences now drops any id that isLocalDraftId()
(workflow_id, variant_id, revision id); slugs/version are kept. NOTE: the
critical drawer direct-invoke path (depth-0) is a separate reference
builder owned by Mahmoud — same isLocalDraftId guard applies there.

[C] The sidebar workflow switcher showed human (feedback) evaluators. The
old filter `!w.flags?.is_feedback` ran on nonArchivedEvaluatorsAtom LIST
records, which carry no `data.uri` and no `is_feedback`/`is_llm`/`is_code`
flags — those live on the revision, not the parent artifact — so the filter
never excluded anything. Switch to fullPagePlaygroundEvaluatorsAtom, the
existing drop-in that resolves flags from each evaluator's latest revision
and excludes human AND declarative-classifier evaluators (both route to an
/apps/<id> destination the guard bounces back to /evaluators).
Temporary console.debug instrumentation to pin the QA bug where
re-selecting the same app after disconnect connects nothing. Logs the
connect-flow decision points:
  - handleAppSelect fired (+ whether it bails on null evaluatorNode)
  - connectApp entry (nodesBefore)
  - changePrimaryNode result (nodeId, nodesAfter)
  - connectDownstreamNode result (downstreamResult, nodesAfter)
  - connectApp done (finalNodes)

To be reverted once root-caused. Filter the browser console by
`[B-repro]` during repro.
QA round 2: in the evaluator playground, selecting an app → disconnecting
→ re-selecting the SAME app connected nothing in the UI (workflow selector
+ generation panel stayed on the 'Select an app' empty state).

Root cause (pinned via runtime instrumentation): the node graph IS correct
after reconnect — connectAppToEvaluatorAtom writes playgroundNodesAtom to
[app, evaluator] and a follow-up read confirms 2 nodes in the single store.
But on a disconnect→reconnect cycle jotai applies the two sequential
playgroundNodesAtom writes (changePrimaryNode → connectDownstreamNode)
WITHOUT notifying the mounted dependents, so selectedAppLabelAtom /
hasAppConnectedAtom (and the package's generation-panel atoms) never
recompute and the UI shows stale 'disconnected' state. First-connect and
disconnect notify fine; only the reconnect drops the notification.

Fix: after the graph mutations in connectAppToEvaluatorAtom, read the
node-derived display atoms (selectedAppLabelAtom, hasAppConnectedAtom) to
re-establish the dependency and flush the pending notification to their
subscribers. Verified locally: reconnect now updates both the selector and
the generation panel.

Also removes the temporary [B-repro] diagnostics added while root-causing.
…QA critical)

The critical QA bug: invoking an LLM-as-a-judge evaluator opened from the
drawer 422'd because the request shipped
references.evaluator_revision.id = "local-…" (an unsaved local-draft id),
which the backend /invoke validator rejects as a non-UUID.

buildEvaluatorSelfReferences (chain stage refs) was already guarded, but
references can also arrive from the requestPayload builder and from
trace-span extraction. Rather than chase each builder, add a single final
sanitization at the one chokepoint where the request body's references are
assembled (buildExecutionItem, after all sources are merged): drop any
reference id that is a local-draft or placeholder id, keep slug/version,
and drop a slot that ends up empty.

Path-agnostic — covers the drawer direct-invoke, the chained evaluator
playground, and any future reference source.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Frontend Improvement size:XXL This PR changes 1000+ lines, ignoring generated files. tests UI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrong requests after switching app in evaluator playground

4 participants