[FE Fix]: Re-enable full-page playground for evaluator workflows#4474
[FE Fix]: Re-enable full-page playground for evaluator workflows#4474ardaerzin wants to merge 22 commits into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR enables Phase 5 evaluator full-page playground navigation, adds app connect/disconnect flows derived from the playground node graph, and refactors observability trace_type defaults, persisted filter composition, and UI reconciliation to be workflow-aware. ChangesEvaluator Full-Page Navigation & Observability Integration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
web/oss/src/components/Evaluators/components/ConfigureEvaluator/atoms.ts (1)
165-178:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winPersisted app selection can get stale on failed connect/disconnect edge paths.
persistedAppSelectionAtomis written before the primary-node swap succeeds, and disconnect exits early without clearing persisted state when no downstream node is found. That can rehydrate an app selection that is no longer actually connected.Proposed fix
export const connectAppToEvaluatorAtom = atom( @@ - // Persist across sessions. The picker display label is derived from - // the depth-0 node's `label` via `selectedAppLabelAtom`, so no extra - // write needed here. - set(persistedAppSelectionAtom, {appRevisionId, appLabel}) - // Replace primary node with app const nodeId = set(playgroundController.actions.changePrimaryNode, { type: "workflow", id: appRevisionId, label: appLabel, }) if (!nodeId) return + // Persist only after graph mutation succeeds. + set(persistedAppSelectionAtom, {appRevisionId, appLabel}) @@ export const disconnectAppFromEvaluatorAtom = atom(null, (get, set) => { const nodes = get(playgroundController.selectors.nodes()) const downstreamEvaluator = nodes.find((n) => n.depth > 0) - if (!downstreamEvaluator) return + if (!downstreamEvaluator) { + set(persistedAppSelectionAtom, null) + return + }Also applies to: 208-225
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: ce60569f-f33c-480b-a472-4ceb822d0b1e
📒 Files selected for processing (25)
web/oss/src/components/Evaluators/components/ConfigureEvaluator/EvaluatorPlaygroundHeader.tsxweb/oss/src/components/Evaluators/components/ConfigureEvaluator/atoms.tsweb/oss/src/components/Evaluators/components/ConfigureEvaluator/index.tsxweb/oss/src/components/Evaluators/index.tsxweb/oss/src/components/Filters/Filters.tsxweb/oss/src/components/Filters/types.d.tsweb/oss/src/components/Playground/Components/PlaygroundVariantConfig/assets/PlaygroundVariantConfigHeader.tsxweb/oss/src/components/PlaygroundRouter/index.tsxweb/oss/src/components/Sidebar/components/WorkflowEntityCard.tsxweb/oss/src/components/WorkflowRevisionDrawerWrapper/index.tsxweb/oss/src/components/pages/evaluations/NewEvaluation/Components/CreateEvaluatorDrawer/index.tsxweb/oss/src/components/pages/observability/assets/filters/fieldAdapter.tsweb/oss/src/components/pages/observability/components/ObservabilityHeader/index.tsxweb/oss/src/state/newObservability/atoms/controls.tsweb/oss/src/state/newObservability/atoms/queries.tsweb/oss/src/state/workflow/flags.tsweb/oss/tests/playwright/acceptance/evaluators/index.tsweb/oss/tests/playwright/acceptance/evaluators/tests.tsweb/packages/agenta-entities/src/workflow/core/index.tsweb/packages/agenta-entities/src/workflow/core/schema.tsweb/packages/agenta-entities/src/workflow/core/traceTypeDefault.tsweb/packages/agenta-entities/src/workflow/index.tsweb/packages/agenta-entities/tests/unit/traceTypeDefault.test.tsweb/packages/agenta-entity-ui/src/selection/adapters/workflowRevisionRelationAdapter.tsweb/packages/agenta-playground/src/state/execution/executionRunner.ts
CodeRabbit flagged 5 issues on the evaluator-full-page rollout PR.
This commit addresses each:
1. PlaygroundRouter — `is_feedback` evaluators skip the full-page swap.
`workflowKind === "evaluator"` was too broad. Human/feedback
evaluators are drawer-only in /evaluators (they capture human input,
they don't run), so routing them to ConfigureEvaluatorPage produced
a run-controls UI for a workflow with nothing to run. Added a
`flags.is_feedback` exclusion next to the workflowKind check.
2. Sidebar — switcher filters out `is_feedback` evaluators.
`nonArchivedEvaluatorsAtom` only filters by `deleted_at` and
includes human evaluators; the switcher was exposing entries that,
when clicked, would land on the (now-correctly-gated) generic
<Playground /> for a feedback workflow. Filtered the list at the
switcher boundary.
3. controls.ts — handle array-valued `trace_type` for in/not_in.
The dialog dispatches `{operator: "in", value: ["annotation"]}` for
the IN operator family, but the intent setter only normalized
scalars — so the user's choice was silently dropped to
`{kind: "cleared"}`. Normalize to an array, filter to enum values,
and collapse single-value arrays back to a scalar. Multi-value
selections (which mean "no filter" for a 2-value enum) still map
to `cleared`.
4. Playwright — drop stale `[data-row-key]` poll in select-app-and-run.
The test asserted post-create navigation to /apps/<id>/playground
AFTER polling for the new row in the evaluators table — but the
redirect wins first, the table disappears, and the poll became a
timing-dependent failure. Removed the registry-side wait;
evaluator-in-registry assertion is covered by the
post-create-row-click test alongside.
5. ConfigureEvaluator/atoms.ts — fix persistedAppSelectionAtom race.
`connectAppToEvaluatorAtom` persisted the app selection BEFORE
`changePrimaryNode` ran, so a failed swap (returns `null` with no
primary to swap from) left a stale localStorage record that the
next mount re-hydrated into a phantom "connected" state. Moved the
persist call to after both graph mutations succeed.
`disconnectAppFromEvaluatorAtom` early-returned on no-downstream
without clearing the persisted state, allowing the same phantom
record to survive a disconnect attempt. Clear it on that branch
too.
No behavior change for the happy-path full-page flow — these all
narrow edge cases the reviewer flagged.
|
Actionable comments posted: 0 |
|
Actionable comments posted: 0 |
Railway Preview Environment
|
fail is not related to a frontend change tho |
…ssion-fix Resolved conflicts: - web/oss/src/components/Filters/Filters.tsx — kept this branch's `displayedFilter` (reconciles filter rows for evaluator workflows) with main's `filterContainerClass` plain-class styling. - web/packages/agenta-playground/src/state/execution/executionRunner.ts — kept this branch's `stageReferences` builder which merges upstream app references with the evaluator's own self-references (via `buildEvaluatorSelfReferences`). Main's variant dropped references for evaluator stages, which was the regression PR #4474 is fixing — evaluator traces need `references.evaluator.slug` attached so they are searchable on the evaluator's /apps/<evalId>/traces page.
…ssion-fix Resolves a single conflict in `web/packages/agenta-entities/src/workflow/core/schema.ts` — release v0.100.4 added `artifact_slug` / `variant_slug` to the revision schema alongside the `workflow_slug` / `workflow_variant_slug` fields this branch had introduced for emitting evaluator references on playground chain runs. Both sides added `workflow_slug` and `workflow_variant_slug` with overlapping intent; resolution keeps all four fields and merges the two doc comments into one that covers both purposes (parent-workflow identification for ID-less callers + evaluator chain-trace emission). No source behavior change — schema is additive on both sides.
…cation-regression-fix
…cation-regression-fix
…ssion-fix Resolved conflicts: - web/oss/src/components/Filters/Filters.tsx — kept this branch's `displayedFilter` (reconciles filter rows for evaluator workflows) with main's `filterContainerClass` plain-class styling. - web/packages/agenta-playground/src/state/execution/executionRunner.ts — kept this branch's `stageReferences` builder which merges upstream app references with the evaluator's own self-references (via `buildEvaluatorSelfReferences`). Main's variant dropped references for evaluator stages, which was the regression PR #4474 is fixing — evaluator traces need `references.evaluator.slug` attached so they are searchable on the evaluator's /apps/<evalId>/traces page.
Issue: In the LLM-as-a-judge playground, switching the chained app from a
chat application to a completion application kept sending `context` and
`messages` from the previous app in the new request body.
Root cause: At `executionRunner.ts` for depth=0 (the root entity), the
runner spreads the entire row's data into `nodeInputs` (`{...data}`) and
hands it to the stage handle as `inputValues`. The downstream filter in
`resolveVariableValues` / `buildCompletionInputRow` correctly drops keys
that aren't in the entity's input variables — when `variables` is non-
empty. But when the entity's input ports haven't resolved yet (entity
mid-hydration) or genuinely declares no input variables, that filter
falls back to "spread every key from the row", which is exactly the
window in which stale chat-shape keys (`messages`, `context`) leak into
a completion request.
Fix: Filter `data` at the runner against the entity's declared
`inputSchema.properties` BEFORE building `nodeInputs`. This applies to
both the first execution (line ~417) and the repetition retries (line
~689). When the entity has no resolvable input schema, the helper falls
back to `{...data}` so workflows that genuinely depend on free-form
input (e.g. `__rawBody` app workflows whose variables live in
`__meta.variables`) keep working.
The fix is safe for chat mode: chat strips `messages` separately at line
587 of `executionItems.ts` and rebuilds the conversation from
`chatHistory` via `messageIdsAtomFamily(loadableId)` — independent of
`inputValues`.
Defense-in-depth: this complements the existing
`resolveVariableValues` filter rather than replacing it.
The evaluator info notice in SingleLayout rendered with hardcoded light-mode colors (bg-blue-50, text-gray-700) and was unreadable against the dark UI. Add dark: variants to background, border, icon, body text, and dismiss button to match the existing dark:bg-blue-900/* pattern used elsewhere in the app.
Previous attempt used dark:text-gray-200 which conflicted with the themeAwareColors CSS-variable layer — the gray scale is role-inverted in dark mode, so dark:text-gray-200 resolved to a dark shade against the dark callout background. Switch overrides to the blue scale (not theme-flipped): dark:text-blue-50 for body text, dark:text-blue-300 for the icon, and dark:text-blue-200 for the dismiss button. All readable against dark:bg-blue-900/20.
The first #4525 fix only covered the depth=0 (root entity) path. In the LLM-as-a-judge evaluator playground the chained app sits at depth>0, where input construction goes through resolveChainInputs (spreads testcaseData on the no-mapping branch) or buildEvaluatorExecutionInputs (spreads testcaseData when the schema allows additionalProperties). Both paths re-leak the stale `messages` field from a previous chat app into the current target entity's request body. Add stripChatTransportForEntity — a targeted strip of known chat- transport keys (currently just `messages`) that runs unless the target entity's input schema explicitly declares them. Applied: - depth=0 path: as a defense-in-depth pass after the strict filterDataToEntityInputSchema, so the spread fallback (taken while the new app's schema is mid-hydration) can't leak the stale field either. - depth=0 repetition path: same. - depth>0 path: pre-filters `data` before chain / evaluator input construction. Uses a targeted strip (rather than the strict schema filter) so evaluators that legitimately depend on additionalProperties: true spread of testcase columns keep receiving them. The helper short-circuits to the input reference when no chat transport keys are present, so there's no allocation in the common path.
…chat keys Diagnostic telemetry for #4525 / AGE-3793 — three console.warn signals in executionRunner so we can tell which layer is actually rescuing the request body during a chat→completion swap: 1. filterDataToEntityInputSchema schema-not-resolved fallback — the strict allow-list can't run because workflowMolecule.selectors .ioSchemas returned no inputSchema.properties. Logs the entityId, the reason (no-properties vs properties-not-object), the data keys present, and whether `messages` is among them. 2. filterDataToEntityInputSchema empty-properties fallback — schema resolved but Object.keys(properties).length === 0. Same payload. 3. stripChatTransportForEntity strip — emits only when a chat-transport key was actually dropped, with which keys and whether the entity schema was resolved at the time of the strip. All three are warn-level so they're visible in production console without code changes, and gated to the unusual paths so the happy path stays quiet.
…4525) Move the stale-key fix from execution-time stripping to the layer where it belongs: the testcase row store, on swap of the primary entity. The testcaseMolecule is shared across loadables, so when the user swaps the chained app in the LLM-as-a-judge playground (anchor positional swap in setEntityIdsAtom), the same rows now carry every key the previous primary populated — `messages` from a prior chat app, completion variables from a prior completion app, etc. Reconciliation strategy (decided with the user): - Closed schema (additionalProperties: false): drop any row key not declared by the new entity's inputSchema.properties. Drops silently — no toast, no confirm modal. Matches what the user typed for the new app and nothing more. - Open schema (additionalProperties not set or true): only strip the CHAT_TRANSPORT_KEYS set (currently `messages`). Evaluator workflows that legitimately depend on additionalProperties spread keep receiving their extra testcase columns. - Schema not resolved: skip. The execution-time strip in executionRunner.ts is the fallback during this hydration window — it will be removed in a follow-up commit once the row-layer fix is verified end-to-end and a reactive deferred reconciliation handles the hydration race. Mutation goes through testcaseMolecule.actions.batchUpdate with stale keys set to `undefined` (the store's update reducer interprets that as a delete). Drafts are created per affected row. A console.warn is emitted in two cases: - schema-not-resolved on swap (so we can verify the hydration race surface area in practice). - one summary per swap that lists which keys were dropped per row and the schema mode (closed vs open).
…chema (#4525) Root cause of the `context` leak that survived the prior fixes: both the row prune and the runtime filter read the allow-list from `workflowMolecule.selectors.ioSchemas(entityId).inputSchema.properties`, which is EMPTY for completion apps. Completion apps express their variables as prompt template placeholders surfaced through `inputPorts`, not through the static input schema. So the filter degraded to its empty-properties fallback (keep everything) and only the hardcoded chat-transport strip removed `messages` — `context` (a real chat template var, stale on the row) sailed through. Diagnostic confirmation from the repro console: [executionRunner.filter] empty-properties fallback {entityId, dataKeys: ['messages','context','country'], hasMessagesKey: true} Fix: new shared helper `state/helpers/entityInputContract.ts` that resolves the allow-list the SAME way executionItems builds request `variables`: variablesFromInputPorts = inputPorts[].key variablesFromPayload = requestPayload.__meta.variables ?? requestPayload.variables ?? [] variables = inputPorts.length > 0 ? inputPorts : payload (+ `messages` when executionMode === 'chat') `reconcileRowDataForEntity` applies the policy: - app with resolved contract → strict allow-list (drops context+messages) - evaluator → chat-transport-only strip (preserves additionalProperties spread of extra testcase columns) - unresolved contract → chat-transport-only safety strip Both consumers now delegate to it: - playgroundController.pruneTestcaseRowsForEntity (swap-time, primary fix) - executionRunner.reconcileEntityInputData (exec-time hydration safety net) This collapses the three ad-hoc helpers (filterDataToEntityInputSchema, stripChatTransportForEntity, getEntityInputSchema) into one correct source-of-truth resolution and removes the now-misleading empty-properties / schema-not-resolved diagnostics.
…nputs (#4525) The prior fix stripped stale keys from the APP's request inputs, but the trace still showed `context` because it surfaces in the downstream EVALUATOR's {inputs, outputs} envelope. The evaluator reads the SAME shared testcase row, and the evaluator policy (chat-transport-only) intentionally preserves non-`messages` keys — so `context` survived there. The UI row also still showed it. Also: the swap-time prune in setEntityIdsAtom never fired for this flow — the evaluator playground selects the app via add/remove node actions in ConfigureEvaluator, not a setEntityIds positional swap (no [playgroundController.prune] log appeared in the repro). Fix: reconcile the shared testcase row against the ROOT entity's input contract in webWorkerIntegration, right before execution — path-agnostic, fires on every run regardless of how the app was selected. The cleaned row is: - passed to the runner (so app request AND evaluator envelope are clean), - written back via loadableController.actions.updateRow (so the UI and future runs reflect it; undefined values delete the keys). Evaluator-referenced columns are protected: collectDownstreamReferencedColumns gathers testcase columns named by downstream evaluator `<input>_key` settings (e.g. correct_answer_key → ground_truth) and passes them as protectedKeys, so a strict clean against the app contract never drops intentional evaluation inputs. reconcileRowDataForEntity gains an optional protectedKeys set; a key survives strict filtering when it's in the app allow-list OR protected. Emits [webWorker.reconcile] when keys are dropped, listing the strategy, dropped keys, and protected columns.
…4525) Two follow-ups now that the row-reconciliation fix is verified working: 1. Clean-on-swap: the evaluator playground selects an app via changePrimaryNode + connectDownstreamNode (ConfigureEvaluator), not a setEntityIds positional swap — so the previously-wired swap-time prune never fired there. Add playgroundController.actions.reconcileRowsToPrimary and call it from connectAppToEvaluatorAtom AFTER connectDownstreamNode, so the shared testcase row is cleaned the instant the app changes (not only at run time). Running after the downstream connect means the evaluator's referenced columns (correct_answer_key → ground_truth, etc.) are protected from the strict app-contract clean. pruneTestcaseRowsForEntity now: - collects downstream-evaluator protected columns, - returns a status ('acted' | 'noop' | 'unresolved'). reconcileRowsToPrimary handles the hydration race: if the new primary's inputPorts aren't resolved yet AND the entity isn't loaded, it subscribes to inputPorts and retries once, then unsubscribes. If the entity is loaded but genuinely has no variables, it doesn't subscribe (no dangling sub). The run-time reconciliation in webWorkerIntegration remains the backstop. 2. Remove diagnostic logs added while tracing the bug: [executionRunner.filter], [webWorker.reconcile], [playgroundController.prune]. The reconcile + writeback logic stays; only the console.warn telemetry is dropped.
317d6e6 to
de548da
Compare
Adds a 'Run on' control to the evaluator (LLM-as-a-judge) playground header so the first/empty state explains itself instead of leaving the user with two disconnected loaders. Three modes, each drawing its own data-flow: - Run directly on a test case (Data -> Evaluator -> Score) - Run on an app output (Data -> App -> Output -> Evaluator -> Score) - default - Run on a trace (Trace -> Evaluator -> Score) - disabled for now The mode is persisted per project; a connected app forces effective 'app' mode. In app mode with no app connected, the run panel hides the testcases and shows a centered 'Select an app' empty state (shared with the evaluator-creation drawer). All colors come from the antd theme token so it follows light/dark mode. Prompt playground is intentionally untouched.
feat(frontend): Run-on mode selector for the evaluator playground
… evals from switcher QA round 2 (2026-06-05): [A] LLM-as-judge chain runs could ship a `local-…` draft id as references.evaluator_revision.id, which the backend rejects as a non-UUID (422). buildEvaluatorSelfReferences now drops any id that isLocalDraftId() (workflow_id, variant_id, revision id); slugs/version are kept. NOTE: the critical drawer direct-invoke path (depth-0) is a separate reference builder owned by Mahmoud — same isLocalDraftId guard applies there. [C] The sidebar workflow switcher showed human (feedback) evaluators. The old filter `!w.flags?.is_feedback` ran on nonArchivedEvaluatorsAtom LIST records, which carry no `data.uri` and no `is_feedback`/`is_llm`/`is_code` flags — those live on the revision, not the parent artifact — so the filter never excluded anything. Switch to fullPagePlaygroundEvaluatorsAtom, the existing drop-in that resolves flags from each evaluator's latest revision and excludes human AND declarative-classifier evaluators (both route to an /apps/<id> destination the guard bounces back to /evaluators).
Temporary console.debug instrumentation to pin the QA bug where re-selecting the same app after disconnect connects nothing. Logs the connect-flow decision points: - handleAppSelect fired (+ whether it bails on null evaluatorNode) - connectApp entry (nodesBefore) - changePrimaryNode result (nodeId, nodesAfter) - connectDownstreamNode result (downstreamResult, nodesAfter) - connectApp done (finalNodes) To be reverted once root-caused. Filter the browser console by `[B-repro]` during repro.
QA round 2: in the evaluator playground, selecting an app → disconnecting → re-selecting the SAME app connected nothing in the UI (workflow selector + generation panel stayed on the 'Select an app' empty state). Root cause (pinned via runtime instrumentation): the node graph IS correct after reconnect — connectAppToEvaluatorAtom writes playgroundNodesAtom to [app, evaluator] and a follow-up read confirms 2 nodes in the single store. But on a disconnect→reconnect cycle jotai applies the two sequential playgroundNodesAtom writes (changePrimaryNode → connectDownstreamNode) WITHOUT notifying the mounted dependents, so selectedAppLabelAtom / hasAppConnectedAtom (and the package's generation-panel atoms) never recompute and the UI shows stale 'disconnected' state. First-connect and disconnect notify fine; only the reconnect drops the notification. Fix: after the graph mutations in connectAppToEvaluatorAtom, read the node-derived display atoms (selectedAppLabelAtom, hasAppConnectedAtom) to re-establish the dependency and flush the pending notification to their subscribers. Verified locally: reconnect now updates both the selector and the generation panel. Also removes the temporary [B-repro] diagnostics added while root-causing.
…QA critical) The critical QA bug: invoking an LLM-as-a-judge evaluator opened from the drawer 422'd because the request shipped references.evaluator_revision.id = "local-…" (an unsaved local-draft id), which the backend /invoke validator rejects as a non-UUID. buildEvaluatorSelfReferences (chain stage refs) was already guarded, but references can also arrive from the requestPayload builder and from trace-span extraction. Rather than chase each builder, add a single final sanitization at the one chokepoint where the request body's references are assembled (buildExecutionItem, after all sources are merged): drop any reference id that is a local-draft or placeholder id, keep slug/version, and drop a slot that ends up empty. Path-agnostic — covers the drawer direct-invoke, the chained evaluator playground, and any future reference source.
Summary
PR #4384 disabled EVALUATOR_FULL_PAGE_NAV_ENABLED because the app-style playground was a regression for evaluators (lost the upstream-app connection) and app-scoped observability defaulted to "invocation" instead of "annotation" for evaluator workflows. This change addresses both blockers and re-enables the flow by default.
Playground
Observability
Other
closes #4525
QA follow-up
Demo
Checklist
Contributor Resources