fix(agent): observe and unblock silent Slack inbox relay drops#2772
fix(agent): observe and unblock silent Slack inbox relay drops#2772posthog[bot] wants to merge 2 commits into
Conversation
Inbox/Slack notifications are posted server-side off the agent relay path, so a silent drop reads to the user as an empty inbox rather than a broken pipe (the reported symptom). This addresses the second, unidentified cause of dropped notifications that #2765 did not fully fix. - Emit stably-named attempt/delivered/dropped telemetry on `relayAgentResponse`, replacing the debug-only logs that swallowed every failure. Drops (relay request failure, no agent message) are now warn-level so they surface in agent logs instead of only as user reports. - Widen `PR_CREATION_RECENCY_MS` from 15 minutes to 24 hours. Notification delivery is gated on attributing a freshly created PR to the run, and the 15-minute window failed closed on any run where the PR surfaced later (multi-turn runs, late/coalesced terminal flushes), so the PR was never attributed and the notification never fired. Generated-By: PostHog Code Task-Id: c18c5b19-a5e4-4c98-95f2-13a68a331e4b
|
React Doctor found no issues in the changed files. 🎉 Reviewed by React Doctor for commit |
Prompt To Fix All With AIFix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
packages/agent/src/server/agent-server.test.ts:1345-1379
**Prefer parameterised tests**
The three `Slack inbox relay telemetry` cases share the same harness (`setup → relayAgentResponse → inspect logger`) and exercise a finite set of `(input, expected outcome)` combinations. Per the project's stated preference, these should be expressed as a single `it.each` block rather than three separate `it` calls. Each row would carry `message`, an optional `relayMessage` factory, and the expected `info`/`warn` assertion data, letting the table surface the full funnel in one glance.
### Issue 2 of 2
packages/agent/src/server/agent-server.ts:2289-2305
**Flush failure loses root cause in the `slack_inbox_relay` telemetry shape**
When `flush` throws, the raw `logger.warn("Failed to flush logs before Slack relay", …)` sits outside the `slack_inbox_relay` structured shape. If the flush failure then causes `getFullAgentResponse` to return `undefined`, the funnel records `dropped: no_agent_message` — hiding the real upstream cause. An engineer querying `telemetry:slack_inbox_relay` will see a `no_agent_message` drop and have no signal that a flush failure preceded it. Passing `reason: "flush_failed"` (and the serialised error) into `recordRelayTelemetry` when the flush throws would keep the root cause inside the queryable shape.
Reviews (1): Last reviewed commit: "fix(agent): observe and unblock silent S..." | Re-trigger Greptile |
| it("records attempt then delivered on a successful relay", async () => { | ||
| const s = setup({ message: "hello" }); | ||
| await s.relayAgentResponse(payload); | ||
| expect(s.posthogAPI.relayMessage).toHaveBeenCalledWith("t", "r", "hello"); | ||
| expect(outcomesFrom(s.logger.info)).toEqual(["attempt", "delivered"]); | ||
| expect(s.logger.warn).not.toHaveBeenCalled(); | ||
| }); | ||
|
|
||
| it("records a warn-level dropped event when the relay request throws", async () => { | ||
| const relayMessage = vi.fn(async () => { | ||
| throw new Error("boom"); | ||
| }); | ||
| const s = setup({ message: "hello", relayMessage }); | ||
| await s.relayAgentResponse(payload); | ||
| expect(s.logger.warn).toHaveBeenCalledTimes(1); | ||
| const event = s.logger.warn.mock.calls[0][1] as { | ||
| outcome: string; | ||
| reason: string; | ||
| }; | ||
| expect(event.outcome).toBe("dropped"); | ||
| expect(event.reason).toBe("relay_request_failed"); | ||
| }); | ||
|
|
||
| it("records a dropped event when there is no agent message to relay", async () => { | ||
| const s = setup({ message: undefined }); | ||
| await s.relayAgentResponse(payload); | ||
| expect(s.posthogAPI.relayMessage).not.toHaveBeenCalled(); | ||
| expect(s.logger.warn).toHaveBeenCalledTimes(1); | ||
| const event = s.logger.warn.mock.calls[0][1] as { | ||
| outcome: string; | ||
| reason: string; | ||
| }; | ||
| expect(event.outcome).toBe("dropped"); | ||
| expect(event.reason).toBe("no_agent_message"); | ||
| }); |
There was a problem hiding this comment.
The three Slack inbox relay telemetry cases share the same harness (setup → relayAgentResponse → inspect logger) and exercise a finite set of (input, expected outcome) combinations. Per the project's stated preference, these should be expressed as a single it.each block rather than three separate it calls. Each row would carry message, an optional relayMessage factory, and the expected info/warn assertion data, letting the table surface the full funnel in one glance.
Context Used: Do not attempt to comment on incorrect alphabetica... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/agent/src/server/agent-server.test.ts
Line: 1345-1379
Comment:
**Prefer parameterised tests**
The three `Slack inbox relay telemetry` cases share the same harness (`setup → relayAgentResponse → inspect logger`) and exercise a finite set of `(input, expected outcome)` combinations. Per the project's stated preference, these should be expressed as a single `it.each` block rather than three separate `it` calls. Each row would carry `message`, an optional `relayMessage` factory, and the expected `info`/`warn` assertion data, letting the table surface the full funnel in one glance.
**Context Used:** Do not attempt to comment on incorrect alphabetica... ([source](https://app.greptile.com/review/custom-context?memory=instruction-0))
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| try { | ||
| await this.session.logWriter.flush(payload.run_id, { coalesce: true }); | ||
| } catch (error) { | ||
| this.logger.debug("Failed to flush logs before Slack relay", { | ||
| this.logger.warn("Failed to flush logs before Slack relay", { | ||
| taskId: payload.task_id, | ||
| runId: payload.run_id, | ||
| error, | ||
| error: serializeError(error), | ||
| }); | ||
| } | ||
|
|
||
| const message = this.session.logWriter.getFullAgentResponse(payload.run_id); | ||
| if (!message) { | ||
| this.logger.debug("No agent message found for Slack relay", { | ||
| taskId: payload.task_id, | ||
| runId: payload.run_id, | ||
| this.recordRelayTelemetry("dropped", payload, { | ||
| reason: "no_agent_message", | ||
| sessionRegistered: this.session.logWriter.isRegistered(payload.run_id), | ||
| }); | ||
| return; |
There was a problem hiding this comment.
Flush failure loses root cause in the
slack_inbox_relay telemetry shape
When flush throws, the raw logger.warn("Failed to flush logs before Slack relay", …) sits outside the slack_inbox_relay structured shape. If the flush failure then causes getFullAgentResponse to return undefined, the funnel records dropped: no_agent_message — hiding the real upstream cause. An engineer querying telemetry:slack_inbox_relay will see a no_agent_message drop and have no signal that a flush failure preceded it. Passing reason: "flush_failed" (and the serialised error) into recordRelayTelemetry when the flush throws would keep the root cause inside the queryable shape.
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/agent/src/server/agent-server.ts
Line: 2289-2305
Comment:
**Flush failure loses root cause in the `slack_inbox_relay` telemetry shape**
When `flush` throws, the raw `logger.warn("Failed to flush logs before Slack relay", …)` sits outside the `slack_inbox_relay` structured shape. If the flush failure then causes `getFullAgentResponse` to return `undefined`, the funnel records `dropped: no_agent_message` — hiding the real upstream cause. An engineer querying `telemetry:slack_inbox_relay` will see a `no_agent_message` drop and have no signal that a flush failure preceded it. Passing `reason: "flush_failed"` (and the serialised error) into `recordRelayTelemetry` when the flush throws would keep the root cause inside the queryable shape.
How can I resolve this? If you propose a fix, please make it concise.- Fix the failing `quality` check: run biome formatter on the relay telemetry test (the line that broke `biome ci`). - Keep flush failures inside the `slack_inbox_relay` telemetry shape via a new `flush_failed` outcome, and tag the subsequent `no_agent_message` drop with `flushFailed` so a cascaded drop's root cause stays queryable instead of reading as a bare drop (Greptile issue 2). - Express the three relay funnel cases as a parameterised `it.each` table per the project's test convention; keep the flush-cascade case separate since its setup/assertions differ (Greptile issue 1). Generated-By: PostHog Code Task-Id: c18c5b19-a5e4-4c98-95f2-13a68a331e4b
Problem
The PostHog team building the Tuesday 2026-06-23 Slack launch is losing personal inbox notifications silently — they just don't arrive, so users assume a quiet inbox rather than a broken pipe ("thought there was something wrong with my inbox being empty"). #2765 root-caused and fixed one cause (the Bash-only PR-detection gate), but a second cause remained and there was zero delivery-side observability to find it.
Why: Notification delivery is posted server-side off the agent relay path and is gated on attributing a freshly created PR to the run. Two mechanisms drop notifications with no observable signal.
Changes
relayAgentResponsepreviously swallowed everyrelayMessagefailure atdebuglevel and early-returned on a missing message with no signal. It now emits a stably-namedslack_inbox_relaytelemetry shape capturing the fullattempt → delivered/droppedfunnel. Drops (relay request failure, no agent message) arewarn-level so they surface in agent logs instead of only as user reports.PR_CREATION_RECENCY_MSfrom 15 minutes to 24 hours. The 15-minute window failed closed on any run where the PR surfaced later than 15 minutes (multi-turn runs, late/coalesced terminal flushes), so the PR was never attributed and the attribution-gated notification never fired. The window still excludes genuinely stale PRs the agent merely views.How did you test this?
pnpm --filter @posthog/agent test pr-url-detector— 11 pass, including new coverage that a PR created an hour earlier is still attributed under the default window.POSTHOG_ALLOW_UNSIGNED_GIT=1 pnpm --filter @posthog/agent test agent-server.test— added 3 relay-telemetry tests (attempt→delivered, warn-level drop on relay failure, drop on missing message); all pass. The 3 pre-existingbuildCloudSystemPromptfailures are unrelated (they fail identically on cleanmain).pnpm --filter @posthog/agent typecheckand Biome lint on changed files — clean.Automatic notifications
Created with PostHog Code from an inbox report.