Skip to content

fix(agent): observe and unblock silent Slack inbox relay drops#2772

Draft
posthog[bot] wants to merge 2 commits into
mainfrom
posthog-code/observe-slack-inbox-relay-drops
Draft

fix(agent): observe and unblock silent Slack inbox relay drops#2772
posthog[bot] wants to merge 2 commits into
mainfrom
posthog-code/observe-slack-inbox-relay-drops

Conversation

@posthog

@posthog posthog Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Problem

The PostHog team building the Tuesday 2026-06-23 Slack launch is losing personal inbox notifications silently — they just don't arrive, so users assume a quiet inbox rather than a broken pipe ("thought there was something wrong with my inbox being empty"). #2765 root-caused and fixed one cause (the Bash-only PR-detection gate), but a second cause remained and there was zero delivery-side observability to find it.

Why: Notification delivery is posted server-side off the agent relay path and is gated on attributing a freshly created PR to the run. Two mechanisms drop notifications with no observable signal.

Changes

  • Make relay drops observable. relayAgentResponse previously swallowed every relayMessage failure at debug level and early-returned on a missing message with no signal. It now emits a stably-named slack_inbox_relay telemetry shape capturing the full attempt → delivered/dropped funnel. Drops (relay request failure, no agent message) are warn-level so they surface in agent logs instead of only as user reports.
  • Unblock attribution on real runs. Widened PR_CREATION_RECENCY_MS from 15 minutes to 24 hours. The 15-minute window failed closed on any run where the PR surfaced later than 15 minutes (multi-turn runs, late/coalesced terminal flushes), so the PR was never attributed and the attribution-gated notification never fired. The window still excludes genuinely stale PRs the agent merely views.

How did you test this?

  • pnpm --filter @posthog/agent test pr-url-detector — 11 pass, including new coverage that a PR created an hour earlier is still attributed under the default window.
  • POSTHOG_ALLOW_UNSIGNED_GIT=1 pnpm --filter @posthog/agent test agent-server.test — added 3 relay-telemetry tests (attempt→delivered, warn-level drop on relay failure, drop on missing message); all pass. The 3 pre-existing buildCloudSystemPrompt failures are unrelated (they fail identically on clean main).
  • pnpm --filter @posthog/agent typecheck and Biome lint on changed files — clean.

Automatic notifications

  • Publish to changelog?
  • Alert Sales and Marketing teams?

Created with PostHog Code from an inbox report.

Inbox/Slack notifications are posted server-side off the agent relay path, so a silent drop reads to the user as an empty inbox rather than a broken pipe (the reported symptom). This addresses the second, unidentified cause of dropped notifications that #2765 did not fully fix.

- Emit stably-named attempt/delivered/dropped telemetry on `relayAgentResponse`, replacing the debug-only logs that swallowed every failure. Drops (relay request failure, no agent message) are now warn-level so they surface in agent logs instead of only as user reports.
- Widen `PR_CREATION_RECENCY_MS` from 15 minutes to 24 hours. Notification delivery is gated on attributing a freshly created PR to the run, and the 15-minute window failed closed on any run where the PR surfaced later (multi-turn runs, late/coalesced terminal flushes), so the PR was never attributed and the notification never fired.

Generated-By: PostHog Code
Task-Id: c18c5b19-a5e4-4c98-95f2-13a68a331e4b
@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown

React Doctor found no issues in the changed files. 🎉

Reviewed by React Doctor for commit cd04536.

@greptile-apps

greptile-apps Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
packages/agent/src/server/agent-server.test.ts:1345-1379
**Prefer parameterised tests**

The three `Slack inbox relay telemetry` cases share the same harness (`setup → relayAgentResponse → inspect logger`) and exercise a finite set of `(input, expected outcome)` combinations. Per the project's stated preference, these should be expressed as a single `it.each` block rather than three separate `it` calls. Each row would carry `message`, an optional `relayMessage` factory, and the expected `info`/`warn` assertion data, letting the table surface the full funnel in one glance.

### Issue 2 of 2
packages/agent/src/server/agent-server.ts:2289-2305
**Flush failure loses root cause in the `slack_inbox_relay` telemetry shape**

When `flush` throws, the raw `logger.warn("Failed to flush logs before Slack relay", …)` sits outside the `slack_inbox_relay` structured shape. If the flush failure then causes `getFullAgentResponse` to return `undefined`, the funnel records `dropped: no_agent_message` — hiding the real upstream cause. An engineer querying `telemetry:slack_inbox_relay` will see a `no_agent_message` drop and have no signal that a flush failure preceded it. Passing `reason: "flush_failed"` (and the serialised error) into `recordRelayTelemetry` when the flush throws would keep the root cause inside the queryable shape.

Reviews (1): Last reviewed commit: "fix(agent): observe and unblock silent S..." | Re-trigger Greptile

Comment on lines +1345 to +1379
it("records attempt then delivered on a successful relay", async () => {
const s = setup({ message: "hello" });
await s.relayAgentResponse(payload);
expect(s.posthogAPI.relayMessage).toHaveBeenCalledWith("t", "r", "hello");
expect(outcomesFrom(s.logger.info)).toEqual(["attempt", "delivered"]);
expect(s.logger.warn).not.toHaveBeenCalled();
});

it("records a warn-level dropped event when the relay request throws", async () => {
const relayMessage = vi.fn(async () => {
throw new Error("boom");
});
const s = setup({ message: "hello", relayMessage });
await s.relayAgentResponse(payload);
expect(s.logger.warn).toHaveBeenCalledTimes(1);
const event = s.logger.warn.mock.calls[0][1] as {
outcome: string;
reason: string;
};
expect(event.outcome).toBe("dropped");
expect(event.reason).toBe("relay_request_failed");
});

it("records a dropped event when there is no agent message to relay", async () => {
const s = setup({ message: undefined });
await s.relayAgentResponse(payload);
expect(s.posthogAPI.relayMessage).not.toHaveBeenCalled();
expect(s.logger.warn).toHaveBeenCalledTimes(1);
const event = s.logger.warn.mock.calls[0][1] as {
outcome: string;
reason: string;
};
expect(event.outcome).toBe("dropped");
expect(event.reason).toBe("no_agent_message");
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Prefer parameterised tests

The three Slack inbox relay telemetry cases share the same harness (setup → relayAgentResponse → inspect logger) and exercise a finite set of (input, expected outcome) combinations. Per the project's stated preference, these should be expressed as a single it.each block rather than three separate it calls. Each row would carry message, an optional relayMessage factory, and the expected info/warn assertion data, letting the table surface the full funnel in one glance.

Context Used: Do not attempt to comment on incorrect alphabetica... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/agent/src/server/agent-server.test.ts
Line: 1345-1379

Comment:
**Prefer parameterised tests**

The three `Slack inbox relay telemetry` cases share the same harness (`setup → relayAgentResponse → inspect logger`) and exercise a finite set of `(input, expected outcome)` combinations. Per the project's stated preference, these should be expressed as a single `it.each` block rather than three separate `it` calls. Each row would carry `message`, an optional `relayMessage` factory, and the expected `info`/`warn` assertion data, letting the table surface the full funnel in one glance.

**Context Used:** Do not attempt to comment on incorrect alphabetica... ([source](https://app.greptile.com/review/custom-context?memory=instruction-0))

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines 2289 to 2305
try {
await this.session.logWriter.flush(payload.run_id, { coalesce: true });
} catch (error) {
this.logger.debug("Failed to flush logs before Slack relay", {
this.logger.warn("Failed to flush logs before Slack relay", {
taskId: payload.task_id,
runId: payload.run_id,
error,
error: serializeError(error),
});
}

const message = this.session.logWriter.getFullAgentResponse(payload.run_id);
if (!message) {
this.logger.debug("No agent message found for Slack relay", {
taskId: payload.task_id,
runId: payload.run_id,
this.recordRelayTelemetry("dropped", payload, {
reason: "no_agent_message",
sessionRegistered: this.session.logWriter.isRegistered(payload.run_id),
});
return;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Flush failure loses root cause in the slack_inbox_relay telemetry shape

When flush throws, the raw logger.warn("Failed to flush logs before Slack relay", …) sits outside the slack_inbox_relay structured shape. If the flush failure then causes getFullAgentResponse to return undefined, the funnel records dropped: no_agent_message — hiding the real upstream cause. An engineer querying telemetry:slack_inbox_relay will see a no_agent_message drop and have no signal that a flush failure preceded it. Passing reason: "flush_failed" (and the serialised error) into recordRelayTelemetry when the flush throws would keep the root cause inside the queryable shape.

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/agent/src/server/agent-server.ts
Line: 2289-2305

Comment:
**Flush failure loses root cause in the `slack_inbox_relay` telemetry shape**

When `flush` throws, the raw `logger.warn("Failed to flush logs before Slack relay", …)` sits outside the `slack_inbox_relay` structured shape. If the flush failure then causes `getFullAgentResponse` to return `undefined`, the funnel records `dropped: no_agent_message` — hiding the real upstream cause. An engineer querying `telemetry:slack_inbox_relay` will see a `no_agent_message` drop and have no signal that a flush failure preceded it. Passing `reason: "flush_failed"` (and the serialised error) into `recordRelayTelemetry` when the flush throws would keep the root cause inside the queryable shape.

How can I resolve this? If you propose a fix, please make it concise.

- Fix the failing `quality` check: run biome formatter on the relay telemetry test (the line that broke `biome ci`).
- Keep flush failures inside the `slack_inbox_relay` telemetry shape via a new `flush_failed` outcome, and tag the subsequent `no_agent_message` drop with `flushFailed` so a cascaded drop's root cause stays queryable instead of reading as a bare drop (Greptile issue 2).
- Express the three relay funnel cases as a parameterised `it.each` table per the project's test convention; keep the flush-cascade case separate since its setup/assertions differ (Greptile issue 1).

Generated-By: PostHog Code
Task-Id: c18c5b19-a5e4-4c98-95f2-13a68a331e4b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants