[FE Fix]: Re-enable full-page playground for evaluator workflows by ardaerzin · Pull Request #4474 · Agenta-AI/agenta

ardaerzin · 2026-05-28T11:12:40Z

Summary

PR #4384 disabled EVALUATOR_FULL_PAGE_NAV_ENABLED because the app-style playground was a regression for evaluators (lost the upstream-app connection) and app-scoped observability defaulted to "invocation" instead of "annotation" for evaluator workflows. This change addresses both blockers and re-enables the flow by default.

Playground

added app chaining for evaluator workflows
minor ui fixes

Observability

fixed and improved filtering for evaluator workflows

Other

closes #4525

QA follow-up

full app pages router tests for evaluator workflows, and checking against reasons why we disabled this feature after its initial release

Demo

Checklist

I have included a video or screen recording for UI changes, or marked Demo as N/A
Relevant tests pass locally
Relevant linting and formatting pass locally
I have signed the CLA, or I will sign it when the bot prompts me

Contributor Resources

vercel · 2026-05-28T11:12:46Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agenta-documentation	Ready	Preview, Comment	Jun 5, 2026 9:57pm

coderabbitai · 2026-05-28T11:12:49Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR enables Phase 5 evaluator full-page playground navigation, adds app connect/disconnect flows derived from the playground node graph, and refactors observability trace_type defaults, persisted filter composition, and UI reconciliation to be workflow-aware.

Changes

Evaluator Full-Page Navigation & Observability Integration

Layer / File(s)	Summary
Trace type defaults helper and workflow schema exports `web/packages/agenta-entities/src/workflow/core/traceTypeDefault.ts`, `web/packages/agenta-entities/src/workflow/core/index.ts`, `web/packages/agenta-entities/src/workflow/index.ts`, `web/packages/agenta-entities/tests/unit/traceTypeDefault.test.ts`, `web/packages/agenta-entities/src/workflow/core/schema.ts`	Adds `defaultTraceTypeForWorkflow` plus types and unit tests; documents `workflow_slug`/`workflow_variant_slug` usage for evaluator reference chains.
Feature flag and router navigation `web/oss/src/state/workflow/flags.ts`, `web/oss/src/components/PlaygroundRouter/index.tsx`	Enables `EVALUATOR_FULL_PAGE_NAV_ENABLED` and routes non-feedback evaluator workflows to `ConfigureEvaluatorPage`.
App connection state management `web/oss/src/components/Evaluators/components/ConfigureEvaluator/atoms.ts`	Derives selected app label from playground node graph; persists selection only after graph mutations succeed; adds `disconnectAppFromEvaluatorAtom`.
Evaluator header UI with app controls `web/oss/src/components/Evaluators/components/ConfigureEvaluator/EvaluatorPlaygroundHeader.tsx`	Adds disconnect button (picker footer and standalone icon), `handleDisconnect`, computes `hasAppSelected`, and renders `TestsetDropdown` unconditionally with updated comment.
ConfigureEvaluatorPage for full-page mode `web/oss/src/components/Evaluators/components/ConfigureEvaluator/index.tsx`	Removes run-disabled gating and inline picker prompt; wires `handleAppSelect`; sets `parentLabel: "Application"` in app workflow adapter.
Evaluators registry row-click simplified `web/oss/src/components/Evaluators/index.tsx`	Simplifies row-click routing to use `EVALUATOR_FULL_PAGE_NAV_ENABLED && workflowId`.
Drawer navigation and post-create flow `web/oss/src/components/WorkflowRevisionDrawerWrapper/index.tsx`, `web/oss/src/components/pages/evaluations/NewEvaluation/Components/CreateEvaluatorDrawer/index.tsx`	Restores persisted app via `connectApp` (no direct label writes), removes explicit label clears, adds `parentLabel: "Application"`, and simplifies post-create navigation gate.
Sidebar evaluator switcher gating `web/oss/src/components/Sidebar/components/WorkflowEntityCard.tsx`	Builds switcher from `nonArchivedEvaluatorsAtom` gated by the feature flag and adds Evaluators group conditionally.
Trace type state persistence and derivation `web/oss/src/state/newObservability/atoms/controls.ts`	Adds `TraceTypeChoice`, persists per-app/per-tab filters, and exposes `effectiveTraceTypeAtomFamily` derived using workflow defaults.
Filter regeneration and scope composition `web/oss/src/state/newObservability/atoms/controls.ts`, `web/oss/src/state/newObservability/atoms/queries.ts`	`filtersAtomFamily` composes permanent scope filter + derived `trace_type` row + persisted user filters; `tracesQueryAtom` uses `effectiveAppId` and waits for workflow context resolution.
Filter UI reconciliation for evaluators `web/oss/src/components/Filters/Filters.tsx`, `web/oss/src/components/Filters/types.d.ts`, `web/oss/src/components/pages/observability/assets/filters/fieldAdapter.ts`, `web/oss/src/components/pages/observability/components/ObservabilityHeader/index.tsx`	Adds `reconcileFilterRows` display-only projection; `fieldAdapter` adds `referenceCategory` and de-duplicates UI values; `ObservabilityHeader` supplies reconciler to remap reference categories based on derived `trace_type`.
Workflow adapter labeling `web/packages/agenta-entity-ui/src/selection/adapters/workflowRevisionRelationAdapter.ts`, `web/oss/src/components/Playground/Components/PlaygroundVariantConfig/assets/PlaygroundVariantConfigHeader.tsx`	Adds `parentLabel` option (defaults to "Evaluator"); applied as `"Application"` in evaluator contexts to adjust UI copy and messages.
Evaluator trace reference construction `web/packages/agenta-playground/src/state/execution/executionRunner.ts`	Adds `buildEvaluatorSelfReferences` and merges evaluator self-references into non-root stage references.
Playwright test coverage `web/oss/tests/playwright/acceptance/evaluators/tests.ts`, `web/oss/tests/playwright/acceptance/evaluators/index.ts`	Exports new test constants and template name; rewrites and expands acceptance tests to exercise full-page playground flows and navigation gating.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Agenta-AI/agenta#4384: Related predecessor implementing earlier gating and follow-ups for evaluator full-page playground enablement.
Agenta-AI/agenta#4229: Related changes touching EvaluatorsRegistry row interactions and navigation behavior.
Agenta-AI/agenta#4265: Related observability/defaults updates used by this PR.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 60.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly identifies the main change: re-enabling the full-page playground feature for evaluator workflows, which is the primary objective addressed across all modified files.
Description check	✅ Passed	The description clearly addresses the PR objectives by explaining the regressions being fixed and the changes made to re-enable the full-page playground for evaluators.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fe-fix/app-workflow-router-unification-regression-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

web/oss/src/components/Evaluators/components/ConfigureEvaluator/atoms.ts (1)

165-178: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Persisted app selection can get stale on failed connect/disconnect edge paths.

persistedAppSelectionAtom is written before the primary-node swap succeeds, and disconnect exits early without clearing persisted state when no downstream node is found. That can rehydrate an app selection that is no longer actually connected.

Proposed fix

 export const connectAppToEvaluatorAtom = atom(
@@
-        // Persist across sessions. The picker display label is derived from
-        // the depth-0 node's `label` via `selectedAppLabelAtom`, so no extra
-        // write needed here.
-        set(persistedAppSelectionAtom, {appRevisionId, appLabel})
-
         // Replace primary node with app
         const nodeId = set(playgroundController.actions.changePrimaryNode, {
             type: "workflow",
             id: appRevisionId,
             label: appLabel,
         })
 
         if (!nodeId) return
+        // Persist only after graph mutation succeeds.
+        set(persistedAppSelectionAtom, {appRevisionId, appLabel})
@@
 export const disconnectAppFromEvaluatorAtom = atom(null, (get, set) => {
     const nodes = get(playgroundController.selectors.nodes())
     const downstreamEvaluator = nodes.find((n) => n.depth > 0)
-    if (!downstreamEvaluator) return
+    if (!downstreamEvaluator) {
+        set(persistedAppSelectionAtom, null)
+        return
+    }

Also applies to: 208-225

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: ce60569f-f33c-480b-a472-4ceb822d0b1e

📥 Commits

Reviewing files that changed from the base of the PR and between 0b9012d and 048d662.

📒 Files selected for processing (25)

web/oss/src/components/Evaluators/components/ConfigureEvaluator/EvaluatorPlaygroundHeader.tsx
web/oss/src/components/Evaluators/components/ConfigureEvaluator/atoms.ts
web/oss/src/components/Evaluators/components/ConfigureEvaluator/index.tsx
web/oss/src/components/Evaluators/index.tsx
web/oss/src/components/Filters/Filters.tsx
web/oss/src/components/Filters/types.d.ts
web/oss/src/components/Playground/Components/PlaygroundVariantConfig/assets/PlaygroundVariantConfigHeader.tsx
web/oss/src/components/PlaygroundRouter/index.tsx
web/oss/src/components/Sidebar/components/WorkflowEntityCard.tsx
web/oss/src/components/WorkflowRevisionDrawerWrapper/index.tsx
web/oss/src/components/pages/evaluations/NewEvaluation/Components/CreateEvaluatorDrawer/index.tsx
web/oss/src/components/pages/observability/assets/filters/fieldAdapter.ts
web/oss/src/components/pages/observability/components/ObservabilityHeader/index.tsx
web/oss/src/state/newObservability/atoms/controls.ts
web/oss/src/state/newObservability/atoms/queries.ts
web/oss/src/state/workflow/flags.ts
web/oss/tests/playwright/acceptance/evaluators/index.ts
web/oss/tests/playwright/acceptance/evaluators/tests.ts
web/packages/agenta-entities/src/workflow/core/index.ts
web/packages/agenta-entities/src/workflow/core/schema.ts
web/packages/agenta-entities/src/workflow/core/traceTypeDefault.ts
web/packages/agenta-entities/src/workflow/index.ts
web/packages/agenta-entities/tests/unit/traceTypeDefault.test.ts
web/packages/agenta-entity-ui/src/selection/adapters/workflowRevisionRelationAdapter.ts
web/packages/agenta-playground/src/state/execution/executionRunner.ts

CodeRabbit flagged 5 issues on the evaluator-full-page rollout PR. This commit addresses each: 1. PlaygroundRouter — `is_feedback` evaluators skip the full-page swap. `workflowKind === "evaluator"` was too broad. Human/feedback evaluators are drawer-only in /evaluators (they capture human input, they don't run), so routing them to ConfigureEvaluatorPage produced a run-controls UI for a workflow with nothing to run. Added a `flags.is_feedback` exclusion next to the workflowKind check. 2. Sidebar — switcher filters out `is_feedback` evaluators. `nonArchivedEvaluatorsAtom` only filters by `deleted_at` and includes human evaluators; the switcher was exposing entries that, when clicked, would land on the (now-correctly-gated) generic <Playground /> for a feedback workflow. Filtered the list at the switcher boundary. 3. controls.ts — handle array-valued `trace_type` for in/not_in. The dialog dispatches `{operator: "in", value: ["annotation"]}` for the IN operator family, but the intent setter only normalized scalars — so the user's choice was silently dropped to `{kind: "cleared"}`. Normalize to an array, filter to enum values, and collapse single-value arrays back to a scalar. Multi-value selections (which mean "no filter" for a 2-value enum) still map to `cleared`. 4. Playwright — drop stale `[data-row-key]` poll in select-app-and-run. The test asserted post-create navigation to /apps/<id>/playground AFTER polling for the new row in the evaluators table — but the redirect wins first, the table disappears, and the poll became a timing-dependent failure. Removed the registry-side wait; evaluator-in-registry assertion is covered by the post-create-row-click test alongside. 5. ConfigureEvaluator/atoms.ts — fix persistedAppSelectionAtom race. `connectAppToEvaluatorAtom` persisted the app selection BEFORE `changePrimaryNode` ran, so a failed swap (returns `null` with no primary to swap from) left a stale localStorage record that the next mount re-hydrated into a phantom "connected" state. Moved the persist call to after both graph mutations succeed. `disconnectAppFromEvaluatorAtom` early-returned on no-downstream without clearing the persisted state, allowing the same phantom record to survive a disconnect attempt. Clear it on that branch too. No behavior change for the happy-path full-page flow — these all narrow edge cases the reviewer flagged.

coderabbitai · 2026-05-28T13:20:07Z

Actionable comments posted: 0

coderabbitai · 2026-05-28T13:44:50Z

Actionable comments posted: 0

github-actions · 2026-05-28T13:50:11Z

Railway Preview Environment


Preview URL	https://gateway-production-b67c.up.railway.app/w
Project	`agenta-oss-pr-4474`
Image tag	`pr-4474-98073c5`
Status	Deployed
Railway logs	Open logs
Workflow logs	View workflow run
Updated at 2026-06-05T22:06:37.149Z

mmabrouk

Failed to build -> https://github.com/Agenta-AI/agenta_cloud/actions/runs/26751245314/job/78839925341

ardaerzin · 2026-06-01T12:11:30Z

Failed to build -> https://github.com/Agenta-AI/agenta_cloud/actions/runs/26751245314/job/78839925341

fail is not related to a frontend change tho

…ssion-fix Resolved conflicts: - web/oss/src/components/Filters/Filters.tsx — kept this branch's `displayedFilter` (reconciles filter rows for evaluator workflows) with main's `filterContainerClass` plain-class styling. - web/packages/agenta-playground/src/state/execution/executionRunner.ts — kept this branch's `stageReferences` builder which merges upstream app references with the evaluator's own self-references (via `buildEvaluatorSelfReferences`). Main's variant dropped references for evaluator stages, which was the regression PR #4474 is fixing — evaluator traces need `references.evaluator.slug` attached so they are searchable on the evaluator's /apps/<evalId>/traces page.

…ssion-fix Resolves a single conflict in `web/packages/agenta-entities/src/workflow/core/schema.ts` — release v0.100.4 added `artifact_slug` / `variant_slug` to the revision schema alongside the `workflow_slug` / `workflow_variant_slug` fields this branch had introduced for emitting evaluator references on playground chain runs. Both sides added `workflow_slug` and `workflow_variant_slug` with overlapping intent; resolution keeps all four fields and merges the two doc comments into one that covers both purposes (parent-workflow identification for ID-less callers + evaluator chain-trace emission). No source behavior change — schema is additive on both sides.

…ssion-fix

…cation-regression-fix

…ssion-fix Resolved conflicts: - web/oss/src/components/Filters/Filters.tsx — kept this branch's `displayedFilter` (reconciles filter rows for evaluator workflows) with main's `filterContainerClass` plain-class styling. - web/packages/agenta-playground/src/state/execution/executionRunner.ts — kept this branch's `stageReferences` builder which merges upstream app references with the evaluator's own self-references (via `buildEvaluatorSelfReferences`). Main's variant dropped references for evaluator stages, which was the regression PR #4474 is fixing — evaluator traces need `references.evaluator.slug` attached so they are searchable on the evaluator's /apps/<evalId>/traces page.

Issue: In the LLM-as-a-judge playground, switching the chained app from a chat application to a completion application kept sending `context` and `messages` from the previous app in the new request body. Root cause: At `executionRunner.ts` for depth=0 (the root entity), the runner spreads the entire row's data into `nodeInputs` (`{...data}`) and hands it to the stage handle as `inputValues`. The downstream filter in `resolveVariableValues` / `buildCompletionInputRow` correctly drops keys that aren't in the entity's input variables — when `variables` is non- empty. But when the entity's input ports haven't resolved yet (entity mid-hydration) or genuinely declares no input variables, that filter falls back to "spread every key from the row", which is exactly the window in which stale chat-shape keys (`messages`, `context`) leak into a completion request. Fix: Filter `data` at the runner against the entity's declared `inputSchema.properties` BEFORE building `nodeInputs`. This applies to both the first execution (line ~417) and the repetition retries (line ~689). When the entity has no resolvable input schema, the helper falls back to `{...data}` so workflows that genuinely depend on free-form input (e.g. `__rawBody` app workflows whose variables live in `__meta.variables`) keep working. The fix is safe for chat mode: chat strips `messages` separately at line 587 of `executionItems.ts` and rebuilds the conversation from `chatHistory` via `messageIdsAtomFamily(loadableId)` — independent of `inputValues`. Defense-in-depth: this complements the existing `resolveVariableValues` filter rather than replacing it.

The evaluator info notice in SingleLayout rendered with hardcoded light-mode colors (bg-blue-50, text-gray-700) and was unreadable against the dark UI. Add dark: variants to background, border, icon, body text, and dismiss button to match the existing dark:bg-blue-900/* pattern used elsewhere in the app.

Previous attempt used dark:text-gray-200 which conflicted with the themeAwareColors CSS-variable layer — the gray scale is role-inverted in dark mode, so dark:text-gray-200 resolved to a dark shade against the dark callout background. Switch overrides to the blue scale (not theme-flipped): dark:text-blue-50 for body text, dark:text-blue-300 for the icon, and dark:text-blue-200 for the dismiss button. All readable against dark:bg-blue-900/20.

The first #4525 fix only covered the depth=0 (root entity) path. In the LLM-as-a-judge evaluator playground the chained app sits at depth>0, where input construction goes through resolveChainInputs (spreads testcaseData on the no-mapping branch) or buildEvaluatorExecutionInputs (spreads testcaseData when the schema allows additionalProperties). Both paths re-leak the stale `messages` field from a previous chat app into the current target entity's request body. Add stripChatTransportForEntity — a targeted strip of known chat- transport keys (currently just `messages`) that runs unless the target entity's input schema explicitly declares them. Applied: - depth=0 path: as a defense-in-depth pass after the strict filterDataToEntityInputSchema, so the spread fallback (taken while the new app's schema is mid-hydration) can't leak the stale field either. - depth=0 repetition path: same. - depth>0 path: pre-filters `data` before chain / evaluator input construction. Uses a targeted strip (rather than the strict schema filter) so evaluators that legitimately depend on additionalProperties: true spread of testcase columns keep receiving them. The helper short-circuits to the input reference when no chat transport keys are present, so there's no allocation in the common path.

…chat keys Diagnostic telemetry for #4525 / AGE-3793 — three console.warn signals in executionRunner so we can tell which layer is actually rescuing the request body during a chat→completion swap: 1. filterDataToEntityInputSchema schema-not-resolved fallback — the strict allow-list can't run because workflowMolecule.selectors .ioSchemas returned no inputSchema.properties. Logs the entityId, the reason (no-properties vs properties-not-object), the data keys present, and whether `messages` is among them. 2. filterDataToEntityInputSchema empty-properties fallback — schema resolved but Object.keys(properties).length === 0. Same payload. 3. stripChatTransportForEntity strip — emits only when a chat-transport key was actually dropped, with which keys and whether the entity schema was resolved at the time of the strip. All three are warn-level so they're visible in production console without code changes, and gated to the unusual paths so the happy path stays quiet.

…4525) Move the stale-key fix from execution-time stripping to the layer where it belongs: the testcase row store, on swap of the primary entity. The testcaseMolecule is shared across loadables, so when the user swaps the chained app in the LLM-as-a-judge playground (anchor positional swap in setEntityIdsAtom), the same rows now carry every key the previous primary populated — `messages` from a prior chat app, completion variables from a prior completion app, etc. Reconciliation strategy (decided with the user): - Closed schema (additionalProperties: false): drop any row key not declared by the new entity's inputSchema.properties. Drops silently — no toast, no confirm modal. Matches what the user typed for the new app and nothing more. - Open schema (additionalProperties not set or true): only strip the CHAT_TRANSPORT_KEYS set (currently `messages`). Evaluator workflows that legitimately depend on additionalProperties spread keep receiving their extra testcase columns. - Schema not resolved: skip. The execution-time strip in executionRunner.ts is the fallback during this hydration window — it will be removed in a follow-up commit once the row-layer fix is verified end-to-end and a reactive deferred reconciliation handles the hydration race. Mutation goes through testcaseMolecule.actions.batchUpdate with stale keys set to `undefined` (the store's update reducer interprets that as a delete). Drafts are created per affected row. A console.warn is emitted in two cases: - schema-not-resolved on swap (so we can verify the hydration race surface area in practice). - one summary per swap that lists which keys were dropped per row and the schema mode (closed vs open).

…chema (#4525) Root cause of the `context` leak that survived the prior fixes: both the row prune and the runtime filter read the allow-list from `workflowMolecule.selectors.ioSchemas(entityId).inputSchema.properties`, which is EMPTY for completion apps. Completion apps express their variables as prompt template placeholders surfaced through `inputPorts`, not through the static input schema. So the filter degraded to its empty-properties fallback (keep everything) and only the hardcoded chat-transport strip removed `messages` — `context` (a real chat template var, stale on the row) sailed through. Diagnostic confirmation from the repro console: [executionRunner.filter] empty-properties fallback {entityId, dataKeys: ['messages','context','country'], hasMessagesKey: true} Fix: new shared helper `state/helpers/entityInputContract.ts` that resolves the allow-list the SAME way executionItems builds request `variables`: variablesFromInputPorts = inputPorts[].key variablesFromPayload = requestPayload.__meta.variables ?? requestPayload.variables ?? [] variables = inputPorts.length > 0 ? inputPorts : payload (+ `messages` when executionMode === 'chat') `reconcileRowDataForEntity` applies the policy: - app with resolved contract → strict allow-list (drops context+messages) - evaluator → chat-transport-only strip (preserves additionalProperties spread of extra testcase columns) - unresolved contract → chat-transport-only safety strip Both consumers now delegate to it: - playgroundController.pruneTestcaseRowsForEntity (swap-time, primary fix) - executionRunner.reconcileEntityInputData (exec-time hydration safety net) This collapses the three ad-hoc helpers (filterDataToEntityInputSchema, stripChatTransportForEntity, getEntityInputSchema) into one correct source-of-truth resolution and removes the now-misleading empty-properties / schema-not-resolved diagnostics.

…nputs (#4525) The prior fix stripped stale keys from the APP's request inputs, but the trace still showed `context` because it surfaces in the downstream EVALUATOR's {inputs, outputs} envelope. The evaluator reads the SAME shared testcase row, and the evaluator policy (chat-transport-only) intentionally preserves non-`messages` keys — so `context` survived there. The UI row also still showed it. Also: the swap-time prune in setEntityIdsAtom never fired for this flow — the evaluator playground selects the app via add/remove node actions in ConfigureEvaluator, not a setEntityIds positional swap (no [playgroundController.prune] log appeared in the repro). Fix: reconcile the shared testcase row against the ROOT entity's input contract in webWorkerIntegration, right before execution — path-agnostic, fires on every run regardless of how the app was selected. The cleaned row is: - passed to the runner (so app request AND evaluator envelope are clean), - written back via loadableController.actions.updateRow (so the UI and future runs reflect it; undefined values delete the keys). Evaluator-referenced columns are protected: collectDownstreamReferencedColumns gathers testcase columns named by downstream evaluator `<input>_key` settings (e.g. correct_answer_key → ground_truth) and passes them as protectedKeys, so a strict clean against the app contract never drops intentional evaluation inputs. reconcileRowDataForEntity gains an optional protectedKeys set; a key survives strict filtering when it's in the app allow-list OR protected. Emits [webWorker.reconcile] when keys are dropped, listing the strategy, dropped keys, and protected columns.

…4525) Two follow-ups now that the row-reconciliation fix is verified working: 1. Clean-on-swap: the evaluator playground selects an app via changePrimaryNode + connectDownstreamNode (ConfigureEvaluator), not a setEntityIds positional swap — so the previously-wired swap-time prune never fired there. Add playgroundController.actions.reconcileRowsToPrimary and call it from connectAppToEvaluatorAtom AFTER connectDownstreamNode, so the shared testcase row is cleaned the instant the app changes (not only at run time). Running after the downstream connect means the evaluator's referenced columns (correct_answer_key → ground_truth, etc.) are protected from the strict app-contract clean. pruneTestcaseRowsForEntity now: - collects downstream-evaluator protected columns, - returns a status ('acted' | 'noop' | 'unresolved'). reconcileRowsToPrimary handles the hydration race: if the new primary's inputPorts aren't resolved yet AND the entity isn't loaded, it subscribes to inputPorts and retries once, then unsubscribes. If the entity is loaded but genuinely has no variables, it doesn't subscribe (no dangling sub). The run-time reconciliation in webWorkerIntegration remains the backstop. 2. Remove diagnostic logs added while tracing the bug: [executionRunner.filter], [webWorker.reconcile], [playgroundController.prune]. The reconcile + writeback logic stays; only the console.warn telemetry is dropped.

Adds a 'Run on' control to the evaluator (LLM-as-a-judge) playground header so the first/empty state explains itself instead of leaving the user with two disconnected loaders. Three modes, each drawing its own data-flow: - Run directly on a test case (Data -> Evaluator -> Score) - Run on an app output (Data -> App -> Output -> Evaluator -> Score) - default - Run on a trace (Trace -> Evaluator -> Score) - disabled for now The mode is persisted per project; a connected app forces effective 'app' mode. In app mode with no app connected, the run panel hides the testcases and shows a centered 'Select an app' empty state (shared with the evaluator-creation drawer). All colors come from the antd theme token so it follows light/dark mode. Prompt playground is intentionally untouched.

feat(frontend): Run-on mode selector for the evaluator playground

… evals from switcher QA round 2 (2026-06-05): [A] LLM-as-judge chain runs could ship a `local-…` draft id as references.evaluator_revision.id, which the backend rejects as a non-UUID (422). buildEvaluatorSelfReferences now drops any id that isLocalDraftId() (workflow_id, variant_id, revision id); slugs/version are kept. NOTE: the critical drawer direct-invoke path (depth-0) is a separate reference builder owned by Mahmoud — same isLocalDraftId guard applies there. [C] The sidebar workflow switcher showed human (feedback) evaluators. The old filter `!w.flags?.is_feedback` ran on nonArchivedEvaluatorsAtom LIST records, which carry no `data.uri` and no `is_feedback`/`is_llm`/`is_code` flags — those live on the revision, not the parent artifact — so the filter never excluded anything. Switch to fullPagePlaygroundEvaluatorsAtom, the existing drop-in that resolves flags from each evaluator's latest revision and excludes human AND declarative-classifier evaluators (both route to an /apps/<id> destination the guard bounces back to /evaluators).

Temporary console.debug instrumentation to pin the QA bug where re-selecting the same app after disconnect connects nothing. Logs the connect-flow decision points: - handleAppSelect fired (+ whether it bails on null evaluatorNode) - connectApp entry (nodesBefore) - changePrimaryNode result (nodeId, nodesAfter) - connectDownstreamNode result (downstreamResult, nodesAfter) - connectApp done (finalNodes) To be reverted once root-caused. Filter the browser console by `[B-repro]` during repro.

QA round 2: in the evaluator playground, selecting an app → disconnecting → re-selecting the SAME app connected nothing in the UI (workflow selector + generation panel stayed on the 'Select an app' empty state). Root cause (pinned via runtime instrumentation): the node graph IS correct after reconnect — connectAppToEvaluatorAtom writes playgroundNodesAtom to [app, evaluator] and a follow-up read confirms 2 nodes in the single store. But on a disconnect→reconnect cycle jotai applies the two sequential playgroundNodesAtom writes (changePrimaryNode → connectDownstreamNode) WITHOUT notifying the mounted dependents, so selectedAppLabelAtom / hasAppConnectedAtom (and the package's generation-panel atoms) never recompute and the UI shows stale 'disconnected' state. First-connect and disconnect notify fine; only the reconnect drops the notification. Fix: after the graph mutations in connectAppToEvaluatorAtom, read the node-derived display atoms (selectedAppLabelAtom, hasAppConnectedAtom) to re-establish the dependency and flush the pending notification to their subscribers. Verified locally: reconnect now updates both the selector and the generation panel. Also removes the temporary [B-repro] diagnostics added while root-causing.

…QA critical) The critical QA bug: invoking an LLM-as-a-judge evaluator opened from the drawer 422'd because the request shipped references.evaluator_revision.id = "local-…" (an unsaved local-draft id), which the backend /invoke validator rejects as a non-UUID. buildEvaluatorSelfReferences (chain stage refs) was already guarded, but references can also arrive from the requestPayload builder and from trace-span extraction. Rather than chase each builder, add a single final sanitization at the one chokepoint where the request body's references are assembled (buildExecutionItem, after all sources are merged): drop any reference id that is a local-draft or placeholder id, keep slug/version, and drop a slot that ends up empty. Path-agnostic — covers the drawer direct-invoke, the chained evaluator playground, and any future reference source.

ardaerzin marked this pull request as ready for review May 28, 2026 11:20

dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. Frontend labels May 28, 2026

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

vercel Bot deployed to Preview May 28, 2026 13:16 View deployment

vercel Bot deployed to Preview May 28, 2026 13:41 View deployment

vercel Bot deployed to Preview May 29, 2026 09:02 View deployment

junaway changed the base branch from main to release/v0.100.7 May 29, 2026 12:46

vercel Bot deployed to Preview May 29, 2026 12:48 View deployment

ardaerzin self-assigned this May 31, 2026

ardaerzin requested review from ashrafchowdury and mmabrouk May 31, 2026 18:11

vercel Bot deployed to Preview May 31, 2026 18:12 View deployment

mmabrouk marked this pull request as draft June 1, 2026 07:08

mmabrouk marked this pull request as ready for review June 1, 2026 07:09

dosubot Bot added Improvement tests labels Jun 1, 2026

ardaerzin changed the title ~~fix(frontend): re-enable full-page playground for evaluator workflows~~ [FE Fix]: Re-enable full-page playground for evaluator workflows Jun 1, 2026

ashrafchowdury approved these changes Jun 1, 2026

View reviewed changes

Comment thread web/oss/src/components/pages/observability/components/ObservabilityHeader/index.tsx

mmabrouk requested changes Jun 1, 2026

View reviewed changes

mmabrouk changed the base branch from release/v0.100.7 to main June 2, 2026 13:22

vercel Bot deployed to Preview June 2, 2026 13:36 View deployment

ardaerzin and others added 14 commits June 4, 2026 18:45

Merge branch 'main' into fe-fix/app-workflow-router-unification-regre…

d3bf2fe

…ssion-fix

Merge branch 'release/v0.100.7' into fe-fix/app-workflow-router-unifi…

1ac6771

…cation-regression-fix

Merge branch 'release/v0.100.7' into fe-fix/app-workflow-router-unifi…

12be989

…cation-regression-fix

mmabrouk force-pushed the fe-fix/app-workflow-router-unification-regression-fix branch from 317d6e6 to de548da Compare June 4, 2026 18:02

vercel Bot deployed to Preview June 4, 2026 18:06 View deployment

mmabrouk mentioned this pull request Jun 5, 2026

feat(frontend): Run-on mode selector for the evaluator playground #4553

Merged

mmabrouk marked this pull request as ready for review June 5, 2026 09:28

dosubot Bot added the UI label Jun 5, 2026

Merge pull request #4553 from Agenta-AI/fe-feat/evaluator-run-on-mode

72c020f

feat(frontend): Run-on mode selector for the evaluator playground

vercel Bot deployed to Preview June 5, 2026 11:43 View deployment

vercel Bot deployed to Preview June 5, 2026 18:08 View deployment

vercel Bot deployed to Preview June 5, 2026 18:16 View deployment

vercel Bot deployed to Preview June 5, 2026 21:20 View deployment

vercel Bot deployed to Preview June 5, 2026 21:57 View deployment

Conversation

ardaerzin commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Playground

Observability

Other

QA follow-up

Demo

Checklist

Contributor Resources

Uh oh!

vercel Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Railway Preview Environment

Uh oh!

Uh oh!

mmabrouk left a comment

Choose a reason for hiding this comment

Uh oh!

ardaerzin commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ardaerzin commented May 28, 2026 •

edited

Loading

vercel Bot commented May 28, 2026 •

edited

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading