fix(telemetry): Telemetry command-run instrumentation pass (#1446, #1447, #1448, #1439)#71
Draft
aidandaly24 wants to merge 1 commit into
Draft
fix(telemetry): Telemetry command-run instrumentation pass (#1446, #1447, #1448, #1439)#71aidandaly24 wants to merge 1 commit into
aidandaly24 wants to merge 1 commit into
Conversation
…39) + command_run instrumentation for config (1446), run-eval/pause/resume online-eval+online-insights/traces (1447), and the fetch-access TUI path (1448). fetch-access CLI and run.job/ab-test were already instrumented at HEAD.
Coverage Report
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refs aws#1446
Refs aws#1447
Refs aws#1448
Refs aws#1439
Issues
agentcore configcommand (get/set/list of global CLI config) emits no telemetry, so AWS is blind to its failures and usage. This is an internal observability gap, not a user-facing functional bug -- the command works correctly for end users; only failure visibility is missing.fetch accesscommand works correctly; what is missing is Amazon-side usage telemetry —cli.command_runis never recorded when users fetch access info for a gateway or agent, so the team has no analytics on this command's adoption or failure rate. This is a labeled enhancement/telemetry gap, not a defect.Root cause
verified above
Feature request to instrument eval/observability commands. Telemetry is only emitted via withCommandRunTelemetry/runCliCommand -> recordCommandRun -> client.emit('cli.command_run',...) (src/cli/telemetry/cli-command-run.ts:41). COMMAND_SCHEMAS already defines shapes (src/cli/telemetry/schemas/command-run.ts: 'run.eval':245, 'pause.online-eval'/'resume.online-eval':258-259, 'pause.online-insights'/'resume.online-insights':260-261, 'traces.list'/'traces.get':262-263), but handlers were never wired: run/command.tsx:170 calls handleRunEval directly; TUI RunEvalFlow.tsx:149 unwrapped; pause/command.tsx registerOnlineEvalSubcommand(l.50)/registerOnlineInsightsSubcommand(l.141) call handlePauseResume directly (only registerABTestSubcommand l.85 records); traces/command.tsx:35,107 call handlers directly. logs/command.tsx:35,64 already wrap, so 'logs' is done. No commit/PR references aws#1447.
Telemetry is opt-in per command; the fetch.access schema (command-run.ts:173,255) was scaffolded but neither emit site (CLI command.tsx action lines 21-95, TUI useFetchAccessFlow.ts) was ever wired, so cli.command_run is never recorded for fetch. Verified: no telemetry import/call in either file, no global Commander hook in cli.ts, and 'fetch.access' is never passed to any emit/record/wrapper outside the schema definition.
process.exit in CLI command actions bypasses main finally that runs shutdown.
The fix
Add a
configentry to COMMAND_SCHEMAS in src/cli/telemetry/schemas/command-run.ts -- a small ConfigAttrs with aconfig_actionenum ('get'|'list'|'set') derived from key/value presence (do NOT log key or value: PII/secret risk). Then wrap the action in src/cli/commands/config/command.ts:26-30 asconst result = await withCommandRunTelemetry('config', attrs, () => resolveAction(key,value)());keeping the existing printResult()/process.exit(1). No design decision is actually needed: withCommandRunTelemetry already handles {success:false} Results natively (cli-command-run.ts:107-118 routes !result.success through classifyError(result.error) to record exit_reason='failure'), and the config handlers in actions.ts all return ConfigResult unions and never throw -- so the failure path is captured automatically. Mirror the status command which uses this exact pattern (status/command.tsx:95-98). This satisfies both DoD items (schema + instrumentation).Mechanical wiring; schemas already exist. (1) run/command.tsx run-eval action (~l.170): wrap handleRunEval in withCommandRunTelemetry('run.eval', {evaluator_count, ref_type, has_assertions, has_expected_trajectory, has_expected_response}) derived from options; mirror in RunEvalFlow.tsx:149 for the TUI path. (2) pause/command.tsx + resume/command.tsx: wrap handlePauseResume in registerOnlineEvalSubcommand/registerOnlineInsightsSubcommand with withCommandRunTelemetry('pause.online-eval'|'resume.online-eval'|'pause.online-insights'|'resume.online-insights', {ref_type}) — both files share these factory functions, so one edit covers pause+resume. (3) traces/command.tsx list/get (~l.35, l.107): wrap handleTracesList/handleTracesGet in withCommandRunTelemetry('traces.list'|'traces.get', {}), optionally upgrade NoAttrs (command-run.ts:262-263) to a small shape (has_runtime/has_since/limit). 'logs' is already done. Design decision: confirm with the author whether 'logs' refers to logs/logs.evals (already instrumented) and whether traces warrants richer attrs than NoAttrs.
Instrument both fetch entry points with the existing helpers, mirroring validate (src/cli/commands/validate/command.tsx:14 uses withCommandRunTelemetry('validate', {}, ...)). (1) CLI: in src/cli/commands/fetch/command.tsx wrap the handleFetchAccess flow in runCliCommand('fetch.access', !!options.json, async () => { ...; return { resource_type: standardize(ResourceType, options.type ?? 'gateway') }; }) — runCliCommand fits because the action owns its own process.exit calls (lines 34, 63, 69); note the action's current early-return-without-exit on JSON success (lines 68-69) means the body needs minor restructuring to return attrs through the wrapper rather than falling through. (2) TUI: in src/cli/tui/screens/fetch-access/useFetchAccessFlow.ts wrap the fetch operation in withCommandRunTelemetry('fetch.access', { resource_type: standardize(ResourceType, resource.resourceType) }, ...) at the fetch call site (~lines 106-133). standardize is exported at common-shapes.ts:20. No schema change needed — fetch.access/FetchAccessAttrs already exist.
Move audit message into flush and add shared finalize before process.exit.
Files touched: src/cli/commands/config/command.ts (action at lines 26-30, wrap resolveAction in withCommandRunTelemetry); src/cli/telemetry/schemas/command-run.ts (add a ConfigAttrs near line 209 and a
config:key into COMMAND_SCHEMAS ~215-302); src/cli/telemetry/schemas/common-shapes.ts (add a ConfigAction enum used by ConfigAttrs)src/cli/commands/run/command.tsx (run-eval action ~line 170); src/cli/tui/screens/run-eval/RunEvalFlow.tsx (~line 149); src/cli/commands/pause/command.tsx (registerOnlineEvalSubcommand ~line 50, registerOnlineInsightsSubcommand ~line 141 — shared with resume via the same factory functions); src/cli/commands/resume/command.tsx (registerResume reuses those factories); src/cli/commands/traces/command.tsx (list ~line 35, get ~line 107); optional attr-shape upgrade in src/cli/telemetry/schemas/command-run.ts lines 262-263. All schema slots already exist.
src/cli/commands/fetch/command.tsx (the .action handler, lines 21-95 — add runCliCommand wrapper, restructure to return resource_type attrs); src/cli/tui/screens/fetch-access/useFetchAccessFlow.ts (add withCommandRunTelemetry wrapper at the fetch-operation call site, ~lines 106-133). No change needed in src/cli/telemetry/schemas/command-run.ts — fetch.access/FetchAccessAttrs already exist (lines 173, 255).
src/cli/telemetry/sinks/filesystem-sink.ts and src/cli/telemetry/cli-command-run.ts
Validation evidence
The fix was verified by reproducing the original symptom and re-running after the change:
(1) Finalize ordering / dropped tail output, on the runCliCommand path
agentcore feedback "test feedback acfix-1446-1447" --json: ORIGINAL build wrote the cli.command_run event into the audit .jsonl (attrs: command=feedback, exit_reason=failure) but printed NO[audit mode] Telemetry written to ...line to stdout, because process.exit(1) in runCliCommand fired before TelemetryClientAccessor.shutdown(). FIXED build records the SAME event AND prints[audit mode] Telemetry written to /tmp/audit-fix-fb/.agentcore/telemetry/feedback-...jsonlbefore exiting; exit code preserved. The shutdown+notices now run inside finalizeAndExit() which executes prior to process.exit (cli.ts registers postCommandFinalize; runCliCommand/config/fetch/pause/run/traces all return finalizeAndExit(code)).(2) Coverage: ORIGINAL
agentcore config telemetry.enabled falsecreated NO telemetry dir at all (zero cli.command_run). FIXED records exactly one event per invocation: config (no key) -> config_action=list, exit_reason=success; config telemetry.enabled (unset get) -> config_action=get, exit_reason=failure; config telemetry.enabled false -> config_action=set, exit_reason=success. Inspected attrs: only the derived config_action is recorded, NO key/value PII (PII keys present: []). Success/failurTest suite: green.
Staged on the fork as a draft for human review. Promote to aws/agentcore-cli after vetting.