Skip to content

feat: add webhook-dx-audit skill (webhooks and event destinations)#67

Open
leggetter wants to merge 31 commits into
mainfrom
feat/webhook-dx-audit-skill
Open

feat: add webhook-dx-audit skill (webhooks and event destinations)#67
leggetter wants to merge 31 commits into
mainfrom
feat/webhook-dx-audit-skill

Conversation

@leggetter
Copy link
Copy Markdown
Collaborator

@leggetter leggetter commented Jun 2, 2026

Summary

Adds webhook-dx-audit, a meta-skill that reviews the developer experience of any platform sending outbound webhooks or event destinations (HTTP, SQS, RabbitMQ, Pub/Sub, EventBridge, Kafka, Azure Event Grid) and produces a scored written review with prioritized recommendations.

Event destinations cover the broader case where a platform delivers events to user-chosen destinations beyond HTTP, so the audit applies to webhook senders and full event-destinations platforms equally.

The skill scores 12 categories (signup, onboarding, docs, event catalog, security/signing, delivery semantics, setup surfaces, SDKs, consumer self-serve, observability, local dev, and agent/AI readiness), weighted by where integrating developers lose the most time and trust. Recommendations point to matching Hookdeck offerings (Console test URLs, CLI, Event Gateway source types, generated webhook skills) as options, not pitches.

Aligned with the Event Destinations initiative

The rubric is benchmarked against the Event Destinations initiative spec at https://eventdestinations.org, not just a Hookdeck-internal opinion. The industry terminology is in flux: Stripe popularized "event destinations" and now delivers directly to Amazon EventBridge and Azure Event Grid alongside webhooks; Shopify is rolling out "Event Subscriptions"; others still call the whole thing "webhooks". The skill scores against the broader concept regardless of the platform's chosen label.

Specifically:

  • Cat 5 (Security & authentication) splits into webhook-specific signing criteria (signature, replay protection, secret rotation, source IP) plus a new "Destination-native auth" criterion covering IAM, service accounts, managed identities, and SASL/mTLS for non-HTTP destinations. Webhook criteria degrade to Not assessed for queue-only platforms.
  • Cat 6 (Delivery semantics & reliability) gains a "Destination type breadth" criterion — the spec's central required capability (at least two destination types including webhooks).
  • Cat 8 (SDKs & verification libraries) stays webhook-focused on purpose, since webhooks remain the most common destination and hand-rolled HMAC is where integrators get burned. Non-HTTP destinations score under Cat 5's native-auth criterion instead.
  • methodology.md adds a step 0 to identify destination types up front, since that determines which criteria apply.

Why it's a different category

The rest of the repo helps you consume webhooks (provider skills, handler patterns) or operate webhook infrastructure (Event Gateway, Outpost). This skill is meta: it audits how a third-party platform exposes outbound webhooks or event destinations to its own customers. It produces a Markdown review, not a handler.

It has no examples/ directory (no code to ship), matching the references-only shape of webhook-handler-patterns.

Placement

Added a new top-level README section, Webhook & Event Destinations DX Audit Skills, after the existing Webhook Infrastructure Skills section. Neither the Provider, Handler Pattern, nor Infrastructure tables fit cleanly: those are about building or running webhook integrations, while this skill evaluates them. A dedicated section keeps the boundary obvious for discovery and leaves room for future evaluation/tooling meta-skills.

What's included

  • skills/webhook-dx-audit/SKILL.md — frontmatter conformed to repo conventions (license, metadata, multi-line description); scope paragraph references the Event Destinations spec
  • skills/webhook-dx-audit/references/rubric.md, methodology.md, scoring.md, program-mapping.md
  • skills/webhook-dx-audit/assets/report-template.md — output template the skill fills in
  • README.md — new "Webhook & Event Destinations DX Audit Skills" section; bundle count updated to 38
  • .claude-plugin/marketplace.json — new plugin entry (category development) with event-destinations keyword, and bundle inclusion

Notes for reviewers

  • Frontmatter conformed to repo conventions; the rubric and methodology were updated to align with the Event Destinations initiative (commit ac81c52). Writing style is developer-to-developer, no em-dashes.
  • No providers.yaml entry (not a provider) and no scripts/test-agent-scenario.sh entry (no runnable code path to test). Mirrors the pattern of webhook-handler-patterns.
  • The program-mapping.md reference points findings at existing Hookdeck offerings (Console, CLI, EG, generated webhook skills) so audits double as a discovery path back into this repo.

Test plan

  • Frontmatter validates (name matches directory, description under 1024 chars, license MIT)
  • marketplace.json parses (verified locally with python3 -c "import json; json.load(open('.claude-plugin/marketplace.json'))")
  • README section reads cleanly and the bundle still totals 38 skills
  • Skill triggers correctly on prompts like "review Stripe's webhook DX", "audit Acme's event destinations", "score Shopify's event subscriptions", or "score Twilio's outbound event delivery"
  • On a multi-destination platform (e.g. Stripe with webhooks + EventBridge + Event Grid), the skill identifies destination types in step 0, scores Cat 6 destination-type breadth correctly, and applies Cat 5 native-auth + webhook-signing criteria appropriately

leggetter added 2 commits June 2, 2026 11:42
Adds a meta/audit skill that reviews the developer experience of any
platform sending outbound webhooks and produces a scored review with
prioritized recommendations across signing, retries, event catalog,
observability, local dev, and agent readiness.

This is a different category from the existing repo skills (which
receive, send, or verify webhooks): the audit evaluates how other
platforms expose their webhook DX. README adds a new "Webhook DX
& Audit Skills" section to make the distinction clear, and the
bundle includes the new skill.
The skill audits webhooks AND event destinations (SQS, RabbitMQ,
Pub/Sub, EventBridge, Kafka), not just webhooks. Surface the
distinction in the README section title and marketplace metadata,
and add an event-destinations keyword for discovery.
@leggetter leggetter changed the title feat: add webhook-dx-audit skill feat: add webhook-dx-audit skill (webhooks and event destinations) Jun 2, 2026
leggetter and others added 27 commits June 2, 2026 11:54
The scope is webhooks AND event destinations, not just webhooks.
Industry terminology is shifting (Stripe "event destinations" with
direct EventBridge/Event Grid delivery, Shopify "Event
Subscriptions"). Update the rubric and SKILL.md to assess the
broader concept and benchmark against the Event Destinations
initiative (https://eventdestinations.org).

Changes:
- SKILL.md: add scope paragraph referencing the spec and the
  terminology shift; clarify that the audit applies regardless of
  what the platform calls it.
- rubric.md intro: add the spec reference and a summary of its
  required/recommended capabilities.
- rubric category 5 (Security & authentication): split into
  webhook-specific signing criteria and a new "Destination-native
  auth" criterion covering IAM, service accounts, managed
  identities, and SASL/mTLS for non-HTTP destinations. Webhook
  criteria become Not assessed for queue-only platforms.
- rubric category 6 (Delivery semantics & reliability): add a
  "Destination type breadth" criterion - the spec's central
  required capability.
- rubric category 8 (SDKs & verification libraries): clarify that
  this stays webhook-focused because webhooks remain the most
  common destination and hand-rolled HMAC is where integrators
  get burned; non-HTTP destinations use native SDK auth and are
  scored under category 5.
- methodology.md: add a step 0 to identify destination types up
  front, since that determines which criteria apply.
…t run

Test-drove the skill against Stripe (Pass 1, public surface only;
84/B). Stripe's three-destination story (webhooks + EventBridge +
Event Grid) surfaced anchor ambiguities and methodology gaps. This
commit applies all 22 ranked fixes from that test, batched because
they cohere and each is small.

rubric.md (10 fixes):
- Cat 4: machine-readable spec anchors name OpenAPI 3.1 webhooks
  block, AsyncAPI, and per-event JSON Schema explicitly; a single
  polymorphic event envelope scores 1, not 2.
- Cat 5: add explicit "if both webhooks AND native destinations
  are offered, score all six criteria" rule at the top.
- Cat 5 destination-auth-options: clarify it scores configurable
  bearer/headers/OAuth2/mTLS independently of the signature scheme.
- Cat 6 failure handling & auto-disable: split anchors into the
  two distinct gaps (post-retry behavior docs, auto-disable
  feature with reactivation).
- Cat 6 failure alerting: limit scoring to push channels; dashboard
  widgets count under cat 10 (observability) instead.
- Cat 6 manual replay: anchors recognize partial coverage
  (sandbox-only, UI-only, CLI-only) at level 1.
- Cat 7 IaC: split community-vs-official cliff; community provider
  with current coverage can score 1, vendor-maintained scores 2.
- Cat 10 latency: spell out the three signals (attempt count,
  next-retry time, per-attempt response latency).
- Cat 12 push-to-agent: add a 1 anchor for partial coverage.
- Cat 12 MCP: 1 anchor names "agent SDK or function-calling
  toolkit"; 2 requires MCP or a deliberate scoped surface.

methodology.md (5 fixes):
- Read-what-a-human-reads: distinguish evidence collection (any
  source) from scoring (HTML page).
- Step 0 destination types: expand search-term list (endpoint,
  partner event source, stream, etc.).
- Step 2 specs: name OpenAPI 3.1 webhooks block, AsyncAPI,
  per-event JSON Schema as the three things to look for.
- Step 9 agent readiness: define "scoped sensibly" for llms.txt.
- What good looks like: handle calibration circularity when the
  audit subject is itself a reference platform (calibrate against
  the broader Event Destinations bar instead).

SKILL.md (2 fixes):
- Add evidence-vs-scoring distinction for .md exports.
- Document the Pass-1-only exit path: skip the human checklist,
  mark gated criteria Not assessed, proceed to scoring.

scoring.md (2 fixes):
- Add a second worked example with a Not-Assessed exclusion.
- Add the renormalization formula and a worked example for when
  a category is fully dropped.

report-template.md (3 fixes):
- Access field examples signal Pass 1 vs Pass 1+2.
- Caption under the scorecard clarifies Overall is weight-adjusted.
- Findings section: always list every criterion, mark unreached
  ones Not assessed inline.

program-mapping.md (3 fixes):
- New row: endpoint health (auto-disable, alerting, reactivation)
  -> Hookdeck Event Gateway in front of the consumer endpoint.
  Addresses the highest-impact cat 6 gap most platforms have.
- New row: OpenAPI lacks webhooks block -> webhook skill in
  hookdeck/webhook-skills as an agent-shaped substitute.
- Broaden the existing webhook-skill row to acknowledge it can
  also surface cat 3/6 docs gaps to agent consumers, not just cat 12.

Test artifacts (Stripe audit + findings doc) saved in /tmp;
not committed.
Investigation of Stripe, Shopify, and Paddle revealed a real
maturity differentiator the rubric was not capturing:

- Paddle ships named "Scenarios" (subscription_creation = 12
  events, renewal = 7, etc.) that fire curated lifecycle sequences
  in one trigger.
- Stripe has implicit prerequisite chaining (firing
  payment_intent.succeeded also fires payment_intent.created)
  plus CLI fixtures for scripted multi-step composition.
- Shopify's webhook trigger is explicitly single-event only with
  fixed payload, recommending real Shopify actions for end-to-end
  tests.

Add a "Workflow / scenario simulation" criterion under category 11
(local dev / testing) as a sibling to test/sandbox parity. Framed
as a maturity differentiator, not a baseline: 0 is acceptable for
most platforms; 1 covers Stripe-style fixtures or implicit chains;
2 covers Paddle-style named lifecycle scenarios.

Update methodology step 8 with search terms (scenario, fixture,
lifecycle, workflow, trigger sequence) and the three platform
patterns as calibration anchors.
The rubric was collapsing three different states into one "Not assessed"
label, which caused Pass-1-only grades to inflate (the Stripe re-audit
went from 84/B to 85/A largely because of this). Split the states and
adjust the math so the labels mean what they say.

Three states (rubric.md):
- Not Supported: capability should exist but doesn't. Score 0;
  numerator 0, full weight in denominator. (Existing 0 behavior;
  the label clarifies intent in evidence.)
- Not Applicable: a logical rule excludes the criterion *as a
  concept* (e.g. Cat 5 destination-native auth on a webhook-only
  platform — there are no non-HTTP destinations to score auth
  for). Drop from both numerator and denominator. Critically NOT
  for "the platform should have this but doesn't" cases — those
  are Not Supported = 0.
- Not Assessed: should assess but cannot reach right now (HITL
  gap, gated dashboard). Treated differently across the two
  roll-ups below; signals HITL would lift the score.

Two roll-ups from the same per-criterion data (scoring.md):
- Public-scope grade. "How good are the parts we could see?"
  Drops both Not Applicable and Not Assessed from numerator and
  denominator. Honest score over what was reachable.
- Provisional minimum. "What's the floor if HITL never runs?"
  Drops Not Applicable only. Treats Not Assessed as 0 in
  numerator with full weight in denominator. HITL Pass 2 can
  only raise this number.

When HITL completes (no Not Assessed criteria remain), the two
scores converge on a single final grade.

Report template (assets/report-template.md):
- Scorecard now shows both columns per category and overall.
- Header shows the headline number twice: Public-scope leads
  when no HITL is planned; Provisional minimum leads when HITL
  is planned (conservative bound the customer can rely on).
- Coverage line under the scorecard counts how many criteria
  landed in each state.
- Optional "Context" line in the frontmatter for audits that
  are existing-customer deliverables.
- Recommendations template encourages "Concrete change (platform
  side) / Hookdeck offering (already available or in path)"
  framing for existing customers.

Cat 5 (rubric.md):
- Header now spells out three branches (webhook-only, non-HTTP-
  only, multi-destination). Per-criterion N/A clauses encode the
  logical rules. Stripe and Shopify both cited as multi-
  destination examples (Shopify ships HTTP + EventBridge + Pub/
  Sub destinations).

Cat 12 CLI for agents:
- Was incorrectly allowed an N/A escape hatch in the prior draft.
  Reverted: a CLI is a recommended capability for any developer
  platform; absence is a gap (Not Supported = 0), not a logical
  exclusion. The 0 anchor language now makes this explicit.

Other updates:
- SKILL.md scope paragraph reflects Shopify is multi-destination,
  not webhook-only. Adds a one-paragraph summary of the three
  states + two roll-ups.
- methodology.md adds a "Pick the right label" note explaining
  when to use each state and why the arithmetic differs.

No customer-specific content; this work is generic to the skill.
Add an explicit N/A logic table after the Categories list. Apply
mechanically based on the destination types identified at
methodology step 0; do not re-derive N/A from per-criterion text.

The table lists the four possible step-0 facts and which criteria
become N/A for each. Currently seven criteria (Cat 5 x 6 + Cat 8
x 1) can be N/A, all driven by two boolean facts (offers webhooks?
offers non-HTTP destinations?).

To make the table the single source of truth:
- Stripped the redundant per-criterion "(Not Applicable if X)"
  clauses from Cat 5 (5 criteria) and Cat 8 (1 criterion).
- Trimmed the Cat 5 header: removed the three-branch list
  (webhook-only / non-HTTP-only / multi-destination) since that's
  now encoded in the table. Kept the security-philosophy paragraph
  because it explains why the criteria differ by destination type.
- Cat 8 verification helper 0 anchor reframed to acknowledge the
  upstream Cat 5 dependency (no signature scheme to verify ->
  the helper question is downstream of that gap).

New criteria with N/A conditions should add a row to the table
rather than introducing a new inline clause. Comments in the
commit message of any future change should reference the table
row affected.
… Assessed

Add a deterministic table tagging which criteria require account-level
(L1) or active-usage (L2) access to score, alongside the existing N/A
logic table. Pass-1 audits at L0 (public only) now have a mechanical
rule for which criteria become Not Assessed; the agent does not have
to derive it per-criterion.

Three access levels (rubric.md):
- L0: public docs, SDK source, machine specs, llms.txt
- L1: logged-in session; can read dashboard, settings, account-gated
  docs
- L2: L1 plus at least one delivered event observed; delivery logs,
  retries, alerting visible in practice

How L1 or L2 was obtained does not matter to the rubric. The auditor
may have signed up themselves, used agent-driven signup (e.g. Stripe
Projects, https://projects.dev), or been given access by the
platform's operator. Future-proof: as agent-signup capabilities
mature, more audits can declare L1/L2 without changing the rubric.

~12 criteria are tagged with required access levels (mostly Cat 1,
2, 7, 9, 10, 11). A few are "L1 or L0 if docs are thorough enough" -
those remain agent judgment within a tighter frame.

Other changes:
- Report template's Access line dropped the "customer-provided
  access" wording (that was an Outpost-audit context leak) and now
  uses the L0/L1/L2 levels directly. A note clarifies that the
  means of obtaining access does not matter, only the level
  reached.
- methodology.md "What good looks like" adds Stripe Projects
  (projects.dev) as an agent-driven provisioning calibration
  anchor for Cat 12 Action-layer scoring.

This makes Not Assessed deterministic at the level the framework
can reasonably enforce. The remaining agent judgment is limited to
the few criteria explicitly tagged "L1 or L0 if ..." in the table.
…n anchors

Hookdeck Outpost and Svix are webhook delivery products platforms
use to send events. Naming them as calibration anchors for
sender-DX scoring was the wrong reference frame: integrators
typically experience the *platform* (its docs, signing scheme,
dashboard), not the delivery infrastructure embedded behind it.
And this skill lives in hookdeck/webhook-skills, so naming Hookdeck
specifically as a benchmark would be a conflict of interest.

Use platforms integrators directly experience and benchmark
against: Stripe as the primary anchor; SendGrid (ECDSA signing),
GitHub (event taxonomy), Twilio (per-attempt status callbacks)
for specific features. The Event Destinations initiative
(eventdestinations.org) sets the broader floor.

Hookdeck Outpost stays in program-mapping.md as a gap-closing
recommendation for the platform side. Hookdeck Event Gateway
tools stay in the "Hookdeck tooling" section of methodology.md
as evidence-gathering aids during the audit (Console test URLs
for inspecting payloads, CLI for receiving on localhost) -
those are ingestion tools for the auditor, not benchmarks.
…m; modern docs platforms

Three small refinements surfaced by the customer audit re-run.

A1: rubric.md access-level table now explicitly authorizes scoring
from L0 absence-of-documentation. If public docs are completely
silent on a capability tagged L1 or L2, score 0 (Not Supported)
from L0 rather than Not Assessed. The access-level requirement is
for VERIFICATION of a documented capability; confirming
non-existence is an L0 finding. Removes the only judgment call I
had to make by interpretation during the re-run.

A2: report-template.md scorecard now surfaces "HITL headroom: NN
points" prominently between the table and the renormalization
caption. Small headroom means HITL won't materially change the
grade; large headroom means HITL is load-bearing. Easier to see
than the gap in the dual-score columns.

A3: Cat 12 push-to-agent criterion now defaults to Not Assessed
(not 0) for docs hosted on modern platforms (Mintlify 2025+,
Docusaurus 3+, GitBook, ReadMe) where Copy-as-Markdown and
Open-in-X are typically JS-rendered. A non-browser fetch may not
see the buttons; the right call is to defer to HITL rather than
score 0 from rendering blindness.

The first customer audit had to interpret all three of these rules;
the framework now encodes them.
All surfaced by the HITL Pass 2 of a real customer audit. Each
addresses a real ambiguity or editorial leak in the rubric.

scoring.md grade bands:
- Dropped the editorial "Reading" column entirely. Grade letters
  alone; the "band is a headline, not the point" note already
  carried the framing. Per audit feedback that "painful or
  risky"-style language doesn't belong in audit output.
- Added explicit "do not write qualitative judgments of the grade
  into the audit report" line.
- Added boundary-zone note for 28-32 (F/D) and 83-87 (B/A) — these
  are sanity-check zones where rounding shifts the band.

Cat 4 payload shape guidance:
- Relaxed the 2 anchor. Was "explicit thin-vs-fat rationale, OR
  standard envelope like CloudEvents". Now "envelope is consistent
  across all event types and documented". CloudEvents alignment
  and thin-vs-fat rationale moved to bonus signals worth citing
  in evidence but not required for 2. Most platforms with strong
  event catalogs don't formally address the meta-framing; the
  prior anchor over-penalized them.

Cat 1 free/test access:
- Reworked anchors to handle two underlying questions (does free
  tier reach config? are test deliveries free?) as a sliding
  scale. 1 covers the partial case (e.g. paid plan required for
  config but test deliveries free once configured, the audited
  customer's shape) which the prior binary 0/2 anchor missed.

Cat 5 destination auth options:
- Requires auth framing in docs for any score above 0. A platform
  shipping an arbitrary header passthrough field without
  documenting it as an auth mechanism now correctly scores 0;
  previously a strict reading allowed 1 for mere field existence.

Audience scoping (new N/A logic Table 2):
- Two audiences: developer-platform (default, where integrators
  are software engineers) and no-code-saas (where integrators are
  power users in a UI). For no-code-saas, Cat 7 IaC and Cat 11
  workflow simulation and local-to-production transition become
  N/A. The third option "mixed" defaults to scoring all criteria
  unless the platform clearly serves one exclusively.
- Audience declared at methodology step 0 and in report
  frontmatter alongside Access level.
- Existing destination-type N/A logic becomes Table 1; audience
  becomes Table 2.

Methodology audit voice guidance:
- Explicit "stay factual, no editorial" rule. Per-category prose
  describes observation; reactions and synthesis go in the summary
  and recommendations. Examples cited: don't use "surprising",
  "impressive", "disappointing", "painful" in per-criterion or
  per-category text.

These changes are derived from real auditor experience; the
customer audit itself stays at /tmp (not committed).
…ess model

The previous Cat 1 "Free/test access" criterion conflated two
distinct concerns: (a) business model (is the platform/feature
free to access), and (b) DX (can you test webhooks without
producing real production activity). Per repeated audit feedback,
business model is not a DX question and shouldn't penalize
platforms that offer webhooks behind a paid plan. The testability
question is already covered by Cat 2 Test event / trigger and Cat
11 Test / sandbox parity.

Cat 1 changes:
- Removed "Free/test access" criterion entirely.
- Reframed "Signup friction to webhook config" as "In-product
  discoverability of webhook configuration". Explicitly handles
  plan-gating: plan-gated features are fine as long as the
  configuration surface is visible in product navigation.
- Findability of webhook docs criterion now distinguishes deep-nav
  (1) from top-level (2) explicitly.

Cat 1 now has 2 criteria, both focused on discoverability:
1. Can a developer find the webhook docs from the top-level docs
   or product nav? (the "docs side" of discovery)
2. From a signed-in account on any tier, can a user discover that
   the platform offers webhooks and find where they would be
   configured? (the "in-product side" of discovery)

Also added a note clarifying what is NOT scored in Cat 1: pre-
purchase evaluation (business model) and production-data isolation
(covered in Cat 2 and Cat 11).

Updated the access-level table to reflect the criterion name
change and removal. Total rubric criterion count drops by one.

Derived from real auditor experience on a paid-plan-gated platform
where the prior rubric incorrectly penalized the business model.
Cat 3 Documentation quality:
- Added "Idempotency guidance" criterion. Scores whether the docs
  (a) identify the unique delivery ID developers should dedupe on
  (a top-level event ID in the payload, a webhook-id-style header,
  or equivalent), and (b) explain the high-level dedup pattern
  (check ID -> process -> store ID -> return success for
  duplicates). 0/1/2.
- Removed "idempotency" from the Best-practices coverage anchor
  list since it now has its own criterion. Best-practices now
  covers: out-of-order delivery, consumer-side retries, timeouts.

Cat 4 Event catalog & schema:
- Added "Per-event unique ID" criterion. Scores whether the
  platform delivers a documented per-event unique delivery ID —
  in the payload, in headers (e.g. webhook-id, X-GitHub-Delivery,
  x-outpost-event-id), or equivalent. Distinct from any domain
  ID inside the payload (e.g. post.id is not a delivery ID).
  0: none. 1: ID delivered but docs don't identify it as the dedup
  key. 2: clearly documented as the dedup key.

The two criteria are explicitly linked: Cat 4 scores whether the
ID exists in the schema; Cat 3 scores whether the docs teach how
to use it. A platform can ship the ID (Cat 4 = 1) without
documenting it (Cat 3 = 0) — exactly the pattern surfaced by the
customer audit, where Outpost ships x-outpost-event-id on every
delivery but the customer didn't surface this to their integrators.

Net effect: rubric grows by two criteria. Platforms that document
idempotency at a high level (signal mention) but don't identify
the dedup ID will now score 1 instead of 2 on the new Cat 3
criterion, surfacing a specific actionable finding.
HITL is used 16+ times across SKILL.md, rubric.md, methodology.md,
scoring.md, and report-template.md without ever being expanded or
defined. Readers outside AI/ML circles can struggle to parse it.

Expand to "human-in-the-loop (HITL)" on the first occurrence in each
file so the abbreviation has a definition before subsequent uses.
Subsequent uses stay as HITL once defined.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
"Documentation quality" reads as a sweeping judgment on the
platform's docs, but the 5 criteria the category scores
(verification walkthrough, processing & handler guidance,
idempotency guidance, best-practices coverage, accuracy & freshness)
all measure implementation-guidance content for integrators
consuming webhooks. The event catalog and API reference are scored
separately under Cat 4 "Event catalog & schema".

A platform with a comprehensive event catalog but no handler
patterns or signing walkthroughs scores 0% on Cat 3, which reads
confusingly because their webhook docs do exist. The new name
makes it clear that Cat 3 scores integration-implementation
content specifically.

Sweep applied via replace_all to rubric.md (category list and
section heading), scoring.md (weight table), and report-template.md
(scorecard row). 4 files, 4 lines net.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After renaming Cat 3 from "Documentation quality" to "Implementation
guidance", the previous intro ("The webhook section as a developer
reads it, not the marketing page") no longer fit. The contrast with
marketing was meaningful when the name was generic; under the new
name, implementation guidance is obviously not marketing.

The new intro is explicit about scope. Cat 3's 5 criteria are
webhook-specific in practice (HMAC verification, 2xx HTTP handler
patterns, dedup ID delivered with HTTP webhooks). Non-HTTP
destinations (SQS, Pub/Sub, RabbitMQ, etc.) rely on destination-
native SDKs; their integration-guidance equivalents are scored under
Cat 5 (destination-native auth) and Cat 6 (delivery semantics).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cat 2 Onboarding & first event: drop "verified" from "I received a
verified event" since verification (HMAC) is a webhook-only concept.
For non-HTTP destinations, the event is just received, not verified
in the same sense.

Cat 5 Security & authentication: replace the editorial intro
("The capability most often weak and most consequential") with a
scope description that mirrors the rest of the rubric: HTTP webhooks
(signing, replay protection, secret rotation) and non-HTTP
destinations (destination-native auth). Weight note kept.

Cat 7 Setup surfaces: "webhooks" -> "webhooks and event destinations"
to match the audit's full scope (the category's criteria already
cover both).

Cat 11 Local dev: drop the vague "The program calls this out
explicitly" trailing sentence (unclear what "the program" referenced).
Replace with an explicit scope note: criteria focus on HTTP webhooks
(localhost tunnels and replay); non-HTTP destinations rely on
cloud-provider emulators (LocalStack, GCP Pub/Sub emulator, Azure
Service Bus emulator) as equivalents.

The other 8 categories' intros were already clean or were updated
previously (Cat 3 just landed in 933b724 and 37761ff).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Methodology step 3: "Read the webhook docs properly" was webhook-only
in framing but the step's scope covers Categories 3, 4, 5, 6.
Categories 4 (event catalog), 5 (security including non-HTTP
destination auth), and 6 (delivery semantics across all destination
types) cover event destinations beyond webhooks. Broaden the step
title and add destination-type-breadth and per-destination-native-
auth as evidence to capture.

Methodology step 5: "API endpoints for webhook CRUD" and "Terraform
provider and whether it covers webhooks" narrowed Category 7 to
webhooks only. Cat 7's intro now reads "webhooks and event
destinations"; the step now matches: webhook and destination CRUD,
and Terraform coverage of webhooks and destinations.

Scoring Example 1: "Security has 5 criteria" was stale; Cat 5 has
6 criteria (the destination-native-auth criterion was added but
Example 1 was never updated). Examples 2-4 already use 6 criteria.
Example 1 now matches: 6 criteria, score 2/1/1/0/2/2, sum 8, max
12, both roll-ups 67%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Line 10 summary instruction said "the platform's webhook DX" but the
audit's scope is webhooks AND event destinations (SQS, Pub/Sub,
RabbitMQ, EventBridge, Kafka, Azure Event Grid). A literal reader
might omit non-HTTP destination coverage. Broaden to "webhook and
event-destination DX".

Line 60 referenced "(see program-mapping)" without a file extension
or backticks, reading as a placeholder. Line 43 already references
`rubric.md` with backticks; match that pattern: "(see
`program-mapping.md`)".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two spots survived the original "Documentation quality" sweep because
they used lowercase or paraphrased forms.

SKILL.md line 45: agent-responsibilities list said "documentation
quality" (lowercase) which the title-case sweep missed. Rename to
"implementation guidance" to match the new category name.

program-mapping.md line 16: "category 3/6 documentation gaps" was
ambiguous after the rename. Replace with "category 3 implementation-
guidance and category 6 delivery-semantics gaps" so both category
references are explicit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Summary instruction told the writer to summarize the platform's
webhook DX but did not say what counts as a webhook-surface feature.
Audit agents reading the instruction listed positive platform signals
they noticed (OpenAPI specs, MCP servers, CLIs) without distinguishing
which ones actually apply to the webhook and event-destination
surface.

This produced misleading Summaries where, for example, an OpenAPI 3.1
spec without a `webhooks` block was listed as evidence of a working
webhook surface even though the spec does not carry webhook payload
contracts (which scores 1 under Cat 4 for that exact reason). The
customer reads the listed item as a strength, then later finds the
caveat that excludes it.

The instruction now scopes the list: include only items that
contribute to the webhook and event-destination surface. An OpenAPI
spec without a `webhooks` block, an MCP server without webhook tools,
or a CLI that does not manage webhook configuration are platform
features that do not apply in the Summary; they belong in their
respective category findings, with the scores that reflect their
limitations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cat 12's Action-layer scoring had two issues:

1. CLI and MCP were scored as separate criteria, requiring each
   surface to exist for full credit. In practice an agent only needs
   one agent-shaped interface beyond the raw API; either suffices.

2. The MCP criterion accepted "an MCP server exists" without
   requiring webhook scope, so a platform-wide MCP that excludes
   webhook management could score 2 even though Cat 12 measures
   webhook agent-readiness. The Ordinal audit hit this tension
   (hosted MCP for the core API but no webhook tools, scored 2).

3. The foundational layer (whether the webhook configuration API
   is publicly callable by an agent) was implicit, scattered across
   Cat 7 API configuration and Cat 4 machine-readable spec. The
   agent-readiness view of the API was not captured as its own
   signal in Cat 12.

The Action layer now has two criteria:

- API access for agents: foundational. Documented public HTTP API
  for webhook configuration. Overlaps with Cat 7 / Cat 4 but
  captures the agent's-eye view distinctly. 0 if dashboard-only or
  undocumented, 1 if SDK-only, 2 if documented HTTP API.

- CLI or MCP for the webhook surface: higher-leverage. CLI or MCP
  (either suffices) covering webhook management with structured
  output / agent-friendly tools. 0 if neither covers webhooks
  (explicitly including platform-wide MCPs without webhook tools),
  1 if partial coverage, 2 if full.

Methodology step 9 Action sentence updated to walk the new
criteria.

Cat 12 still has 6 criteria total; weight unchanged. Existing
audits that scored MCP at 2 because a platform-wide MCP exists
should re-evaluate under the new combined criterion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audits that only have docs evidence produce conditional recommendations
("in default mode the header is X; in Standard Webhooks mode it's Y").
A single actual delivery payload (request headers + body) lets the
auditor recommend directly: name the specific signature header, dedup
ID, timestamp format, and custom headers that are actually in use.

The Ordinal audit hit this exact case: the audit framed signing
conditionally because HITL had not shared an example delivery. Once
Phil shared a screenshot of a real delivery (Standard Webhooks mode
active; webhook-signature, webhook-timestamp, webhook-id headers
present; x-api-key set via the custom-headers feature), the
recommendation became concrete: "document the webhook-signature
you're already sending" rather than "add a signature scheme".

Two updates to the audit skill:

Roles section: add a "Critical HITL capture: an example delivery
payload" paragraph explaining what to capture and why. Whenever the
human fires a test event or observes a real delivery, they capture
and paste back the full delivery payload (all request headers and
the body) so the auditor can score signing, idempotency, event
schema, and destination-auth criteria against the actual delivery
shape rather than docs alone.

How an audit runs step 3 (the HITL checklist examples): add a third
example to the checklist, phrased as "capture and paste back the
full request payload of one real delivery, including all headers
and the body, so I can name the actual signature header, dedup ID,
and any custom headers in the recommendations".

Future audits should now produce concrete signature/dedupe/auth
recommendations whenever HITL is available, since the checklist
specifically requests the payload capture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The audience declaration drives the audit's N/A logic (Table 2 in
the rubric) but the methodology had it as a brief read-and-judge
step with `developer-platform` as a silent default. In practice
agents either took the default without verification or relied on
HITL Pass 2 to set the designation, leading to misframed audits
when the default didn't match reality.

The Ordinal audit hit this: HITL Pass 2 declared no-code-saas
without site verification; the no-code designation triggered the
Cat 11 audience-N/A logic; later correction to mixed required
re-scoring two criteria. A site-verified audience designation at
audit start would have produced the correct framing from Pass 1.

Three updates:

Methodology step 0: explicit checklist of signals to verify the
designation against (hero copy, nav structure, testimonials,
pricing tiers, API prominence, onboarding CTA framing). Requires
citing at least three signals with quoted marketing copy. `mixed`
listed as a first-class option, with guidance to prefer it when
the platform clearly serves more than one audience. The
`developer-platform` default is allowed only as a Pass-1 fallback
when the homepage cannot be reached; Pass 2 must verify.

SKILL.md "Audience matters" paragraph: `mixed` named as one of
three options (not just a fallback). Adds the verification
requirement and points at the methodology checklist. Notes that
mixed audiences score by judgment per criterion.

report-template Audience header: now requires inline citations of
the signals that informed the designation (e.g. "mixed (primary
marketing teams per hero copy 'X'; secondary agencies via 'Y' nav;
tertiary developers via mid-page API mention)"). The bare
designation alone is no longer sufficient.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cat 3 "Processing & handler guidance" was scored against a generic
anchor ("covers the handler lifecycle"). Platforms that mention
"respond fast" and "process async" in passing could score 2 without
teaching the actual ingest-verify-queue pattern, naming their
response timeout, or pointing integrators at concrete architectures.

This is the criterion most directly tied to Hookdeck Event Gateway's
value prop (and to cloud-native alternatives like AWS EventBridge +
API Gateway, GCP Pub/Sub + serverless function), but the rubric
didn't surface that connection.

Three updates:

rubric.md Cat 3 Processing & handler guidance: criterion text now
spells out the ingest-verify-queue pattern as the production-traffic
contract integrators need: acknowledge quickly with 2xx, verify the
signature, queue work to a background processor so burst traffic
and slow downstream work do not exceed the timeout. 2-anchor now
requires the platform to (a) name the timeout window, (b) explain
the pattern, and (c) point at concrete reference architectures
(Hookdeck Event Gateway, cloud-native EventBridge+API Gateway or
Pub/Sub+serverless function, or queue+worker on the integrator's
own infrastructure). 1-anchor covers partial coverage.

methodology.md step 3 (Read the webhook docs): explicit prompt to
look for the response timeout window, the ingest-verify-queue
pattern, and architecture references. Tactics search-term list adds
"timeout", "respond", "async", "queue", "EventBridge", "Pub/Sub",
"Event Gateway", "ingest".

program-mapping.md: new row mapping the ingest-at-scale gap to
Hookdeck Event Gateway as the integrator's ingest layer (or cloud-
native alternatives EventBridge+API Gateway, Pub/Sub+serverless
function). Distinguished from the existing endpoint-health row:
that one is about reliability for an existing handler, this one is
about teaching integrators the pattern itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pattern matters at any volume, not just at scale. A 5-second
timeout kills a delivery whether the integrator is handling 1 req/sec
or 1000. 'Reliably ingest' captures the goal (don't time out, don't
lose deliveries) better than 'at scale', which implies high volume
specifically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the v2 pass: migrate audit format to structured YAML
(primary driver: cloud agent + public website for URL-submitted
audits), consolidate v1's rubric and methodology learnings, and
preserve Ordinal's HITL Pass 2 evidence so it does not need to be
re-collected.

Seven phases sketched: schema design, consolidate v1 learnings,
migrate audit template to YAML, update SKILL.md and methodology,
preserve and port Ordinal HITL evidence, decide downstream
backwards-compat path, re-run Ordinal under v2, cascade to
downstream skill.

Includes a complete inventory of HITL-derived facts to carry
forward (active usage observations, signing and delivery shape
from the captured payload, audience verification, scoring
decisions). Cross-check this list at every phase boundary.

Plan is for review before execution; commit per phase once
execution starts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per direction: upstream skill emits YAML only. No Markdown audit
output, no renderer, no Markdown template, no transitional
backwards-compat phase. The downstream outpost-customer-audit-report
skill is updated in lockstep to consume YAML.

Changes from the previous PLAN-v2 draft:

- Target layout drops renderers/ and assets/report-template.md
- Phase 2 simplifies: produce assets/report-template.yaml and delete
  the Markdown template; no renderer
- Phase 3 SKILL.md update: explicit YAML-only output
- Phase 5 (was "decide on backwards-compatibility") removed; no
  decision to make - downstream cascades in lockstep
- Phase 5 (new, was Phase 6) re-runs Ordinal; produces audit.yaml
  only; v1 audit.md gets archived to customers/ordinal/archive/
- Phase 6 (new, was Phase 7) cascades to downstream skill in
  hookdeck-skills-internal: input becomes YAML, customer report
  stays Markdown (still the customer-facing deliverable)

Customer report format kept as Markdown for now since it is the
customer-facing artifact. Open question for review: if the cloud
agent's website ends up rendering customer reports as well, that
decision can flip to YAML in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leggetter and others added 2 commits June 4, 2026 11:57
Customer report stays Markdown. The cloud agent has no current plan
to render customer reports; the customer-facing artifact is sent or
shared as a file. Decision settled, not an open question.

Open Question 3 ("Customer report format") removed from the open
list and added to a new "Resolved decisions" section at the top of
the resolved choices that v2 execution should not relitigate
(upstream YAML-only, customer Markdown, downstream lockstep).

Remaining open questions renumbered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndations

Three additions to make PLAN-v2 self-sufficient for a fresh agent
picking up the v2 work cold:

Phase 1 consolidation list: each item now has the v1 commit hash
inline so the rationale is one git show away. Editorial qualifier
rules also annotated as downstream-only with a pointer at the
internal repo's methodology.

Schema sketch (illustrative): inline YAML showing the rough shape of
audit.yaml and hitl-evidence.yaml. Field names, nesting, status
enums, and scoring decision records all present. Marked as a
starting point that Phase 0 refines against the schema linter; not
authoritative.

Open question recommendations: each open question now has a
"Recommendation:" line so a fresh agent has a default to push back
against rather than picking from scratch (schema tooling, YAML lib
and lint config, cloud-agent field reservation, archive location;
re-audit timing already had one). Open questions remain genuine
questions; the recommendations are starting points.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant