feat: add webhook-dx-audit skill (webhooks and event destinations)#67
Open
leggetter wants to merge 31 commits into
Open
feat: add webhook-dx-audit skill (webhooks and event destinations)#67leggetter wants to merge 31 commits into
leggetter wants to merge 31 commits into
Conversation
Adds a meta/audit skill that reviews the developer experience of any platform sending outbound webhooks and produces a scored review with prioritized recommendations across signing, retries, event catalog, observability, local dev, and agent readiness. This is a different category from the existing repo skills (which receive, send, or verify webhooks): the audit evaluates how other platforms expose their webhook DX. README adds a new "Webhook DX & Audit Skills" section to make the distinction clear, and the bundle includes the new skill.
The skill audits webhooks AND event destinations (SQS, RabbitMQ, Pub/Sub, EventBridge, Kafka), not just webhooks. Surface the distinction in the README section title and marketplace metadata, and add an event-destinations keyword for discovery.
The scope is webhooks AND event destinations, not just webhooks. Industry terminology is shifting (Stripe "event destinations" with direct EventBridge/Event Grid delivery, Shopify "Event Subscriptions"). Update the rubric and SKILL.md to assess the broader concept and benchmark against the Event Destinations initiative (https://eventdestinations.org). Changes: - SKILL.md: add scope paragraph referencing the spec and the terminology shift; clarify that the audit applies regardless of what the platform calls it. - rubric.md intro: add the spec reference and a summary of its required/recommended capabilities. - rubric category 5 (Security & authentication): split into webhook-specific signing criteria and a new "Destination-native auth" criterion covering IAM, service accounts, managed identities, and SASL/mTLS for non-HTTP destinations. Webhook criteria become Not assessed for queue-only platforms. - rubric category 6 (Delivery semantics & reliability): add a "Destination type breadth" criterion - the spec's central required capability. - rubric category 8 (SDKs & verification libraries): clarify that this stays webhook-focused because webhooks remain the most common destination and hand-rolled HMAC is where integrators get burned; non-HTTP destinations use native SDK auth and are scored under category 5. - methodology.md: add a step 0 to identify destination types up front, since that determines which criteria apply.
…t run Test-drove the skill against Stripe (Pass 1, public surface only; 84/B). Stripe's three-destination story (webhooks + EventBridge + Event Grid) surfaced anchor ambiguities and methodology gaps. This commit applies all 22 ranked fixes from that test, batched because they cohere and each is small. rubric.md (10 fixes): - Cat 4: machine-readable spec anchors name OpenAPI 3.1 webhooks block, AsyncAPI, and per-event JSON Schema explicitly; a single polymorphic event envelope scores 1, not 2. - Cat 5: add explicit "if both webhooks AND native destinations are offered, score all six criteria" rule at the top. - Cat 5 destination-auth-options: clarify it scores configurable bearer/headers/OAuth2/mTLS independently of the signature scheme. - Cat 6 failure handling & auto-disable: split anchors into the two distinct gaps (post-retry behavior docs, auto-disable feature with reactivation). - Cat 6 failure alerting: limit scoring to push channels; dashboard widgets count under cat 10 (observability) instead. - Cat 6 manual replay: anchors recognize partial coverage (sandbox-only, UI-only, CLI-only) at level 1. - Cat 7 IaC: split community-vs-official cliff; community provider with current coverage can score 1, vendor-maintained scores 2. - Cat 10 latency: spell out the three signals (attempt count, next-retry time, per-attempt response latency). - Cat 12 push-to-agent: add a 1 anchor for partial coverage. - Cat 12 MCP: 1 anchor names "agent SDK or function-calling toolkit"; 2 requires MCP or a deliberate scoped surface. methodology.md (5 fixes): - Read-what-a-human-reads: distinguish evidence collection (any source) from scoring (HTML page). - Step 0 destination types: expand search-term list (endpoint, partner event source, stream, etc.). - Step 2 specs: name OpenAPI 3.1 webhooks block, AsyncAPI, per-event JSON Schema as the three things to look for. - Step 9 agent readiness: define "scoped sensibly" for llms.txt. - What good looks like: handle calibration circularity when the audit subject is itself a reference platform (calibrate against the broader Event Destinations bar instead). SKILL.md (2 fixes): - Add evidence-vs-scoring distinction for .md exports. - Document the Pass-1-only exit path: skip the human checklist, mark gated criteria Not assessed, proceed to scoring. scoring.md (2 fixes): - Add a second worked example with a Not-Assessed exclusion. - Add the renormalization formula and a worked example for when a category is fully dropped. report-template.md (3 fixes): - Access field examples signal Pass 1 vs Pass 1+2. - Caption under the scorecard clarifies Overall is weight-adjusted. - Findings section: always list every criterion, mark unreached ones Not assessed inline. program-mapping.md (3 fixes): - New row: endpoint health (auto-disable, alerting, reactivation) -> Hookdeck Event Gateway in front of the consumer endpoint. Addresses the highest-impact cat 6 gap most platforms have. - New row: OpenAPI lacks webhooks block -> webhook skill in hookdeck/webhook-skills as an agent-shaped substitute. - Broaden the existing webhook-skill row to acknowledge it can also surface cat 3/6 docs gaps to agent consumers, not just cat 12. Test artifacts (Stripe audit + findings doc) saved in /tmp; not committed.
Investigation of Stripe, Shopify, and Paddle revealed a real maturity differentiator the rubric was not capturing: - Paddle ships named "Scenarios" (subscription_creation = 12 events, renewal = 7, etc.) that fire curated lifecycle sequences in one trigger. - Stripe has implicit prerequisite chaining (firing payment_intent.succeeded also fires payment_intent.created) plus CLI fixtures for scripted multi-step composition. - Shopify's webhook trigger is explicitly single-event only with fixed payload, recommending real Shopify actions for end-to-end tests. Add a "Workflow / scenario simulation" criterion under category 11 (local dev / testing) as a sibling to test/sandbox parity. Framed as a maturity differentiator, not a baseline: 0 is acceptable for most platforms; 1 covers Stripe-style fixtures or implicit chains; 2 covers Paddle-style named lifecycle scenarios. Update methodology step 8 with search terms (scenario, fixture, lifecycle, workflow, trigger sequence) and the three platform patterns as calibration anchors.
The rubric was collapsing three different states into one "Not assessed" label, which caused Pass-1-only grades to inflate (the Stripe re-audit went from 84/B to 85/A largely because of this). Split the states and adjust the math so the labels mean what they say. Three states (rubric.md): - Not Supported: capability should exist but doesn't. Score 0; numerator 0, full weight in denominator. (Existing 0 behavior; the label clarifies intent in evidence.) - Not Applicable: a logical rule excludes the criterion *as a concept* (e.g. Cat 5 destination-native auth on a webhook-only platform — there are no non-HTTP destinations to score auth for). Drop from both numerator and denominator. Critically NOT for "the platform should have this but doesn't" cases — those are Not Supported = 0. - Not Assessed: should assess but cannot reach right now (HITL gap, gated dashboard). Treated differently across the two roll-ups below; signals HITL would lift the score. Two roll-ups from the same per-criterion data (scoring.md): - Public-scope grade. "How good are the parts we could see?" Drops both Not Applicable and Not Assessed from numerator and denominator. Honest score over what was reachable. - Provisional minimum. "What's the floor if HITL never runs?" Drops Not Applicable only. Treats Not Assessed as 0 in numerator with full weight in denominator. HITL Pass 2 can only raise this number. When HITL completes (no Not Assessed criteria remain), the two scores converge on a single final grade. Report template (assets/report-template.md): - Scorecard now shows both columns per category and overall. - Header shows the headline number twice: Public-scope leads when no HITL is planned; Provisional minimum leads when HITL is planned (conservative bound the customer can rely on). - Coverage line under the scorecard counts how many criteria landed in each state. - Optional "Context" line in the frontmatter for audits that are existing-customer deliverables. - Recommendations template encourages "Concrete change (platform side) / Hookdeck offering (already available or in path)" framing for existing customers. Cat 5 (rubric.md): - Header now spells out three branches (webhook-only, non-HTTP- only, multi-destination). Per-criterion N/A clauses encode the logical rules. Stripe and Shopify both cited as multi- destination examples (Shopify ships HTTP + EventBridge + Pub/ Sub destinations). Cat 12 CLI for agents: - Was incorrectly allowed an N/A escape hatch in the prior draft. Reverted: a CLI is a recommended capability for any developer platform; absence is a gap (Not Supported = 0), not a logical exclusion. The 0 anchor language now makes this explicit. Other updates: - SKILL.md scope paragraph reflects Shopify is multi-destination, not webhook-only. Adds a one-paragraph summary of the three states + two roll-ups. - methodology.md adds a "Pick the right label" note explaining when to use each state and why the arithmetic differs. No customer-specific content; this work is generic to the skill.
Add an explicit N/A logic table after the Categories list. Apply mechanically based on the destination types identified at methodology step 0; do not re-derive N/A from per-criterion text. The table lists the four possible step-0 facts and which criteria become N/A for each. Currently seven criteria (Cat 5 x 6 + Cat 8 x 1) can be N/A, all driven by two boolean facts (offers webhooks? offers non-HTTP destinations?). To make the table the single source of truth: - Stripped the redundant per-criterion "(Not Applicable if X)" clauses from Cat 5 (5 criteria) and Cat 8 (1 criterion). - Trimmed the Cat 5 header: removed the three-branch list (webhook-only / non-HTTP-only / multi-destination) since that's now encoded in the table. Kept the security-philosophy paragraph because it explains why the criteria differ by destination type. - Cat 8 verification helper 0 anchor reframed to acknowledge the upstream Cat 5 dependency (no signature scheme to verify -> the helper question is downstream of that gap). New criteria with N/A conditions should add a row to the table rather than introducing a new inline clause. Comments in the commit message of any future change should reference the table row affected.
… Assessed Add a deterministic table tagging which criteria require account-level (L1) or active-usage (L2) access to score, alongside the existing N/A logic table. Pass-1 audits at L0 (public only) now have a mechanical rule for which criteria become Not Assessed; the agent does not have to derive it per-criterion. Three access levels (rubric.md): - L0: public docs, SDK source, machine specs, llms.txt - L1: logged-in session; can read dashboard, settings, account-gated docs - L2: L1 plus at least one delivered event observed; delivery logs, retries, alerting visible in practice How L1 or L2 was obtained does not matter to the rubric. The auditor may have signed up themselves, used agent-driven signup (e.g. Stripe Projects, https://projects.dev), or been given access by the platform's operator. Future-proof: as agent-signup capabilities mature, more audits can declare L1/L2 without changing the rubric. ~12 criteria are tagged with required access levels (mostly Cat 1, 2, 7, 9, 10, 11). A few are "L1 or L0 if docs are thorough enough" - those remain agent judgment within a tighter frame. Other changes: - Report template's Access line dropped the "customer-provided access" wording (that was an Outpost-audit context leak) and now uses the L0/L1/L2 levels directly. A note clarifies that the means of obtaining access does not matter, only the level reached. - methodology.md "What good looks like" adds Stripe Projects (projects.dev) as an agent-driven provisioning calibration anchor for Cat 12 Action-layer scoring. This makes Not Assessed deterministic at the level the framework can reasonably enforce. The remaining agent judgment is limited to the few criteria explicitly tagged "L1 or L0 if ..." in the table.
…n anchors Hookdeck Outpost and Svix are webhook delivery products platforms use to send events. Naming them as calibration anchors for sender-DX scoring was the wrong reference frame: integrators typically experience the *platform* (its docs, signing scheme, dashboard), not the delivery infrastructure embedded behind it. And this skill lives in hookdeck/webhook-skills, so naming Hookdeck specifically as a benchmark would be a conflict of interest. Use platforms integrators directly experience and benchmark against: Stripe as the primary anchor; SendGrid (ECDSA signing), GitHub (event taxonomy), Twilio (per-attempt status callbacks) for specific features. The Event Destinations initiative (eventdestinations.org) sets the broader floor. Hookdeck Outpost stays in program-mapping.md as a gap-closing recommendation for the platform side. Hookdeck Event Gateway tools stay in the "Hookdeck tooling" section of methodology.md as evidence-gathering aids during the audit (Console test URLs for inspecting payloads, CLI for receiving on localhost) - those are ingestion tools for the auditor, not benchmarks.
…m; modern docs platforms Three small refinements surfaced by the customer audit re-run. A1: rubric.md access-level table now explicitly authorizes scoring from L0 absence-of-documentation. If public docs are completely silent on a capability tagged L1 or L2, score 0 (Not Supported) from L0 rather than Not Assessed. The access-level requirement is for VERIFICATION of a documented capability; confirming non-existence is an L0 finding. Removes the only judgment call I had to make by interpretation during the re-run. A2: report-template.md scorecard now surfaces "HITL headroom: NN points" prominently between the table and the renormalization caption. Small headroom means HITL won't materially change the grade; large headroom means HITL is load-bearing. Easier to see than the gap in the dual-score columns. A3: Cat 12 push-to-agent criterion now defaults to Not Assessed (not 0) for docs hosted on modern platforms (Mintlify 2025+, Docusaurus 3+, GitBook, ReadMe) where Copy-as-Markdown and Open-in-X are typically JS-rendered. A non-browser fetch may not see the buttons; the right call is to defer to HITL rather than score 0 from rendering blindness. The first customer audit had to interpret all three of these rules; the framework now encodes them.
All surfaced by the HITL Pass 2 of a real customer audit. Each addresses a real ambiguity or editorial leak in the rubric. scoring.md grade bands: - Dropped the editorial "Reading" column entirely. Grade letters alone; the "band is a headline, not the point" note already carried the framing. Per audit feedback that "painful or risky"-style language doesn't belong in audit output. - Added explicit "do not write qualitative judgments of the grade into the audit report" line. - Added boundary-zone note for 28-32 (F/D) and 83-87 (B/A) — these are sanity-check zones where rounding shifts the band. Cat 4 payload shape guidance: - Relaxed the 2 anchor. Was "explicit thin-vs-fat rationale, OR standard envelope like CloudEvents". Now "envelope is consistent across all event types and documented". CloudEvents alignment and thin-vs-fat rationale moved to bonus signals worth citing in evidence but not required for 2. Most platforms with strong event catalogs don't formally address the meta-framing; the prior anchor over-penalized them. Cat 1 free/test access: - Reworked anchors to handle two underlying questions (does free tier reach config? are test deliveries free?) as a sliding scale. 1 covers the partial case (e.g. paid plan required for config but test deliveries free once configured, the audited customer's shape) which the prior binary 0/2 anchor missed. Cat 5 destination auth options: - Requires auth framing in docs for any score above 0. A platform shipping an arbitrary header passthrough field without documenting it as an auth mechanism now correctly scores 0; previously a strict reading allowed 1 for mere field existence. Audience scoping (new N/A logic Table 2): - Two audiences: developer-platform (default, where integrators are software engineers) and no-code-saas (where integrators are power users in a UI). For no-code-saas, Cat 7 IaC and Cat 11 workflow simulation and local-to-production transition become N/A. The third option "mixed" defaults to scoring all criteria unless the platform clearly serves one exclusively. - Audience declared at methodology step 0 and in report frontmatter alongside Access level. - Existing destination-type N/A logic becomes Table 1; audience becomes Table 2. Methodology audit voice guidance: - Explicit "stay factual, no editorial" rule. Per-category prose describes observation; reactions and synthesis go in the summary and recommendations. Examples cited: don't use "surprising", "impressive", "disappointing", "painful" in per-criterion or per-category text. These changes are derived from real auditor experience; the customer audit itself stays at /tmp (not committed).
…ess model The previous Cat 1 "Free/test access" criterion conflated two distinct concerns: (a) business model (is the platform/feature free to access), and (b) DX (can you test webhooks without producing real production activity). Per repeated audit feedback, business model is not a DX question and shouldn't penalize platforms that offer webhooks behind a paid plan. The testability question is already covered by Cat 2 Test event / trigger and Cat 11 Test / sandbox parity. Cat 1 changes: - Removed "Free/test access" criterion entirely. - Reframed "Signup friction to webhook config" as "In-product discoverability of webhook configuration". Explicitly handles plan-gating: plan-gated features are fine as long as the configuration surface is visible in product navigation. - Findability of webhook docs criterion now distinguishes deep-nav (1) from top-level (2) explicitly. Cat 1 now has 2 criteria, both focused on discoverability: 1. Can a developer find the webhook docs from the top-level docs or product nav? (the "docs side" of discovery) 2. From a signed-in account on any tier, can a user discover that the platform offers webhooks and find where they would be configured? (the "in-product side" of discovery) Also added a note clarifying what is NOT scored in Cat 1: pre- purchase evaluation (business model) and production-data isolation (covered in Cat 2 and Cat 11). Updated the access-level table to reflect the criterion name change and removal. Total rubric criterion count drops by one. Derived from real auditor experience on a paid-plan-gated platform where the prior rubric incorrectly penalized the business model.
Cat 3 Documentation quality: - Added "Idempotency guidance" criterion. Scores whether the docs (a) identify the unique delivery ID developers should dedupe on (a top-level event ID in the payload, a webhook-id-style header, or equivalent), and (b) explain the high-level dedup pattern (check ID -> process -> store ID -> return success for duplicates). 0/1/2. - Removed "idempotency" from the Best-practices coverage anchor list since it now has its own criterion. Best-practices now covers: out-of-order delivery, consumer-side retries, timeouts. Cat 4 Event catalog & schema: - Added "Per-event unique ID" criterion. Scores whether the platform delivers a documented per-event unique delivery ID — in the payload, in headers (e.g. webhook-id, X-GitHub-Delivery, x-outpost-event-id), or equivalent. Distinct from any domain ID inside the payload (e.g. post.id is not a delivery ID). 0: none. 1: ID delivered but docs don't identify it as the dedup key. 2: clearly documented as the dedup key. The two criteria are explicitly linked: Cat 4 scores whether the ID exists in the schema; Cat 3 scores whether the docs teach how to use it. A platform can ship the ID (Cat 4 = 1) without documenting it (Cat 3 = 0) — exactly the pattern surfaced by the customer audit, where Outpost ships x-outpost-event-id on every delivery but the customer didn't surface this to their integrators. Net effect: rubric grows by two criteria. Platforms that document idempotency at a high level (signal mention) but don't identify the dedup ID will now score 1 instead of 2 on the new Cat 3 criterion, surfacing a specific actionable finding.
HITL is used 16+ times across SKILL.md, rubric.md, methodology.md, scoring.md, and report-template.md without ever being expanded or defined. Readers outside AI/ML circles can struggle to parse it. Expand to "human-in-the-loop (HITL)" on the first occurrence in each file so the abbreviation has a definition before subsequent uses. Subsequent uses stay as HITL once defined. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
"Documentation quality" reads as a sweeping judgment on the platform's docs, but the 5 criteria the category scores (verification walkthrough, processing & handler guidance, idempotency guidance, best-practices coverage, accuracy & freshness) all measure implementation-guidance content for integrators consuming webhooks. The event catalog and API reference are scored separately under Cat 4 "Event catalog & schema". A platform with a comprehensive event catalog but no handler patterns or signing walkthroughs scores 0% on Cat 3, which reads confusingly because their webhook docs do exist. The new name makes it clear that Cat 3 scores integration-implementation content specifically. Sweep applied via replace_all to rubric.md (category list and section heading), scoring.md (weight table), and report-template.md (scorecard row). 4 files, 4 lines net. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After renaming Cat 3 from "Documentation quality" to "Implementation
guidance", the previous intro ("The webhook section as a developer
reads it, not the marketing page") no longer fit. The contrast with
marketing was meaningful when the name was generic; under the new
name, implementation guidance is obviously not marketing.
The new intro is explicit about scope. Cat 3's 5 criteria are
webhook-specific in practice (HMAC verification, 2xx HTTP handler
patterns, dedup ID delivered with HTTP webhooks). Non-HTTP
destinations (SQS, Pub/Sub, RabbitMQ, etc.) rely on destination-
native SDKs; their integration-guidance equivalents are scored under
Cat 5 (destination-native auth) and Cat 6 (delivery semantics).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cat 2 Onboarding & first event: drop "verified" from "I received a
verified event" since verification (HMAC) is a webhook-only concept.
For non-HTTP destinations, the event is just received, not verified
in the same sense.
Cat 5 Security & authentication: replace the editorial intro
("The capability most often weak and most consequential") with a
scope description that mirrors the rest of the rubric: HTTP webhooks
(signing, replay protection, secret rotation) and non-HTTP
destinations (destination-native auth). Weight note kept.
Cat 7 Setup surfaces: "webhooks" -> "webhooks and event destinations"
to match the audit's full scope (the category's criteria already
cover both).
Cat 11 Local dev: drop the vague "The program calls this out
explicitly" trailing sentence (unclear what "the program" referenced).
Replace with an explicit scope note: criteria focus on HTTP webhooks
(localhost tunnels and replay); non-HTTP destinations rely on
cloud-provider emulators (LocalStack, GCP Pub/Sub emulator, Azure
Service Bus emulator) as equivalents.
The other 8 categories' intros were already clean or were updated
previously (Cat 3 just landed in 933b724 and 37761ff).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Methodology step 3: "Read the webhook docs properly" was webhook-only in framing but the step's scope covers Categories 3, 4, 5, 6. Categories 4 (event catalog), 5 (security including non-HTTP destination auth), and 6 (delivery semantics across all destination types) cover event destinations beyond webhooks. Broaden the step title and add destination-type-breadth and per-destination-native- auth as evidence to capture. Methodology step 5: "API endpoints for webhook CRUD" and "Terraform provider and whether it covers webhooks" narrowed Category 7 to webhooks only. Cat 7's intro now reads "webhooks and event destinations"; the step now matches: webhook and destination CRUD, and Terraform coverage of webhooks and destinations. Scoring Example 1: "Security has 5 criteria" was stale; Cat 5 has 6 criteria (the destination-native-auth criterion was added but Example 1 was never updated). Examples 2-4 already use 6 criteria. Example 1 now matches: 6 criteria, score 2/1/1/0/2/2, sum 8, max 12, both roll-ups 67%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Line 10 summary instruction said "the platform's webhook DX" but the audit's scope is webhooks AND event destinations (SQS, Pub/Sub, RabbitMQ, EventBridge, Kafka, Azure Event Grid). A literal reader might omit non-HTTP destination coverage. Broaden to "webhook and event-destination DX". Line 60 referenced "(see program-mapping)" without a file extension or backticks, reading as a placeholder. Line 43 already references `rubric.md` with backticks; match that pattern: "(see `program-mapping.md`)". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two spots survived the original "Documentation quality" sweep because they used lowercase or paraphrased forms. SKILL.md line 45: agent-responsibilities list said "documentation quality" (lowercase) which the title-case sweep missed. Rename to "implementation guidance" to match the new category name. program-mapping.md line 16: "category 3/6 documentation gaps" was ambiguous after the rename. Replace with "category 3 implementation- guidance and category 6 delivery-semantics gaps" so both category references are explicit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Summary instruction told the writer to summarize the platform's webhook DX but did not say what counts as a webhook-surface feature. Audit agents reading the instruction listed positive platform signals they noticed (OpenAPI specs, MCP servers, CLIs) without distinguishing which ones actually apply to the webhook and event-destination surface. This produced misleading Summaries where, for example, an OpenAPI 3.1 spec without a `webhooks` block was listed as evidence of a working webhook surface even though the spec does not carry webhook payload contracts (which scores 1 under Cat 4 for that exact reason). The customer reads the listed item as a strength, then later finds the caveat that excludes it. The instruction now scopes the list: include only items that contribute to the webhook and event-destination surface. An OpenAPI spec without a `webhooks` block, an MCP server without webhook tools, or a CLI that does not manage webhook configuration are platform features that do not apply in the Summary; they belong in their respective category findings, with the scores that reflect their limitations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cat 12's Action-layer scoring had two issues: 1. CLI and MCP were scored as separate criteria, requiring each surface to exist for full credit. In practice an agent only needs one agent-shaped interface beyond the raw API; either suffices. 2. The MCP criterion accepted "an MCP server exists" without requiring webhook scope, so a platform-wide MCP that excludes webhook management could score 2 even though Cat 12 measures webhook agent-readiness. The Ordinal audit hit this tension (hosted MCP for the core API but no webhook tools, scored 2). 3. The foundational layer (whether the webhook configuration API is publicly callable by an agent) was implicit, scattered across Cat 7 API configuration and Cat 4 machine-readable spec. The agent-readiness view of the API was not captured as its own signal in Cat 12. The Action layer now has two criteria: - API access for agents: foundational. Documented public HTTP API for webhook configuration. Overlaps with Cat 7 / Cat 4 but captures the agent's-eye view distinctly. 0 if dashboard-only or undocumented, 1 if SDK-only, 2 if documented HTTP API. - CLI or MCP for the webhook surface: higher-leverage. CLI or MCP (either suffices) covering webhook management with structured output / agent-friendly tools. 0 if neither covers webhooks (explicitly including platform-wide MCPs without webhook tools), 1 if partial coverage, 2 if full. Methodology step 9 Action sentence updated to walk the new criteria. Cat 12 still has 6 criteria total; weight unchanged. Existing audits that scored MCP at 2 because a platform-wide MCP exists should re-evaluate under the new combined criterion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audits that only have docs evidence produce conditional recommendations
("in default mode the header is X; in Standard Webhooks mode it's Y").
A single actual delivery payload (request headers + body) lets the
auditor recommend directly: name the specific signature header, dedup
ID, timestamp format, and custom headers that are actually in use.
The Ordinal audit hit this exact case: the audit framed signing
conditionally because HITL had not shared an example delivery. Once
Phil shared a screenshot of a real delivery (Standard Webhooks mode
active; webhook-signature, webhook-timestamp, webhook-id headers
present; x-api-key set via the custom-headers feature), the
recommendation became concrete: "document the webhook-signature
you're already sending" rather than "add a signature scheme".
Two updates to the audit skill:
Roles section: add a "Critical HITL capture: an example delivery
payload" paragraph explaining what to capture and why. Whenever the
human fires a test event or observes a real delivery, they capture
and paste back the full delivery payload (all request headers and
the body) so the auditor can score signing, idempotency, event
schema, and destination-auth criteria against the actual delivery
shape rather than docs alone.
How an audit runs step 3 (the HITL checklist examples): add a third
example to the checklist, phrased as "capture and paste back the
full request payload of one real delivery, including all headers
and the body, so I can name the actual signature header, dedup ID,
and any custom headers in the recommendations".
Future audits should now produce concrete signature/dedupe/auth
recommendations whenever HITL is available, since the checklist
specifically requests the payload capture.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The audience declaration drives the audit's N/A logic (Table 2 in the rubric) but the methodology had it as a brief read-and-judge step with `developer-platform` as a silent default. In practice agents either took the default without verification or relied on HITL Pass 2 to set the designation, leading to misframed audits when the default didn't match reality. The Ordinal audit hit this: HITL Pass 2 declared no-code-saas without site verification; the no-code designation triggered the Cat 11 audience-N/A logic; later correction to mixed required re-scoring two criteria. A site-verified audience designation at audit start would have produced the correct framing from Pass 1. Three updates: Methodology step 0: explicit checklist of signals to verify the designation against (hero copy, nav structure, testimonials, pricing tiers, API prominence, onboarding CTA framing). Requires citing at least three signals with quoted marketing copy. `mixed` listed as a first-class option, with guidance to prefer it when the platform clearly serves more than one audience. The `developer-platform` default is allowed only as a Pass-1 fallback when the homepage cannot be reached; Pass 2 must verify. SKILL.md "Audience matters" paragraph: `mixed` named as one of three options (not just a fallback). Adds the verification requirement and points at the methodology checklist. Notes that mixed audiences score by judgment per criterion. report-template Audience header: now requires inline citations of the signals that informed the designation (e.g. "mixed (primary marketing teams per hero copy 'X'; secondary agencies via 'Y' nav; tertiary developers via mid-page API mention)"). The bare designation alone is no longer sufficient. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cat 3 "Processing & handler guidance" was scored against a generic
anchor ("covers the handler lifecycle"). Platforms that mention
"respond fast" and "process async" in passing could score 2 without
teaching the actual ingest-verify-queue pattern, naming their
response timeout, or pointing integrators at concrete architectures.
This is the criterion most directly tied to Hookdeck Event Gateway's
value prop (and to cloud-native alternatives like AWS EventBridge +
API Gateway, GCP Pub/Sub + serverless function), but the rubric
didn't surface that connection.
Three updates:
rubric.md Cat 3 Processing & handler guidance: criterion text now
spells out the ingest-verify-queue pattern as the production-traffic
contract integrators need: acknowledge quickly with 2xx, verify the
signature, queue work to a background processor so burst traffic
and slow downstream work do not exceed the timeout. 2-anchor now
requires the platform to (a) name the timeout window, (b) explain
the pattern, and (c) point at concrete reference architectures
(Hookdeck Event Gateway, cloud-native EventBridge+API Gateway or
Pub/Sub+serverless function, or queue+worker on the integrator's
own infrastructure). 1-anchor covers partial coverage.
methodology.md step 3 (Read the webhook docs): explicit prompt to
look for the response timeout window, the ingest-verify-queue
pattern, and architecture references. Tactics search-term list adds
"timeout", "respond", "async", "queue", "EventBridge", "Pub/Sub",
"Event Gateway", "ingest".
program-mapping.md: new row mapping the ingest-at-scale gap to
Hookdeck Event Gateway as the integrator's ingest layer (or cloud-
native alternatives EventBridge+API Gateway, Pub/Sub+serverless
function). Distinguished from the existing endpoint-health row:
that one is about reliability for an existing handler, this one is
about teaching integrators the pattern itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pattern matters at any volume, not just at scale. A 5-second timeout kills a delivery whether the integrator is handling 1 req/sec or 1000. 'Reliably ingest' captures the goal (don't time out, don't lose deliveries) better than 'at scale', which implies high volume specifically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the v2 pass: migrate audit format to structured YAML (primary driver: cloud agent + public website for URL-submitted audits), consolidate v1's rubric and methodology learnings, and preserve Ordinal's HITL Pass 2 evidence so it does not need to be re-collected. Seven phases sketched: schema design, consolidate v1 learnings, migrate audit template to YAML, update SKILL.md and methodology, preserve and port Ordinal HITL evidence, decide downstream backwards-compat path, re-run Ordinal under v2, cascade to downstream skill. Includes a complete inventory of HITL-derived facts to carry forward (active usage observations, signing and delivery shape from the captured payload, audience verification, scoring decisions). Cross-check this list at every phase boundary. Plan is for review before execution; commit per phase once execution starts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per direction: upstream skill emits YAML only. No Markdown audit output, no renderer, no Markdown template, no transitional backwards-compat phase. The downstream outpost-customer-audit-report skill is updated in lockstep to consume YAML. Changes from the previous PLAN-v2 draft: - Target layout drops renderers/ and assets/report-template.md - Phase 2 simplifies: produce assets/report-template.yaml and delete the Markdown template; no renderer - Phase 3 SKILL.md update: explicit YAML-only output - Phase 5 (was "decide on backwards-compatibility") removed; no decision to make - downstream cascades in lockstep - Phase 5 (new, was Phase 6) re-runs Ordinal; produces audit.yaml only; v1 audit.md gets archived to customers/ordinal/archive/ - Phase 6 (new, was Phase 7) cascades to downstream skill in hookdeck-skills-internal: input becomes YAML, customer report stays Markdown (still the customer-facing deliverable) Customer report format kept as Markdown for now since it is the customer-facing artifact. Open question for review: if the cloud agent's website ends up rendering customer reports as well, that decision can flip to YAML in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Customer report stays Markdown. The cloud agent has no current plan
to render customer reports; the customer-facing artifact is sent or
shared as a file. Decision settled, not an open question.
Open Question 3 ("Customer report format") removed from the open
list and added to a new "Resolved decisions" section at the top of
the resolved choices that v2 execution should not relitigate
(upstream YAML-only, customer Markdown, downstream lockstep).
Remaining open questions renumbered.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndations Three additions to make PLAN-v2 self-sufficient for a fresh agent picking up the v2 work cold: Phase 1 consolidation list: each item now has the v1 commit hash inline so the rationale is one git show away. Editorial qualifier rules also annotated as downstream-only with a pointer at the internal repo's methodology. Schema sketch (illustrative): inline YAML showing the rough shape of audit.yaml and hitl-evidence.yaml. Field names, nesting, status enums, and scoring decision records all present. Marked as a starting point that Phase 0 refines against the schema linter; not authoritative. Open question recommendations: each open question now has a "Recommendation:" line so a fresh agent has a default to push back against rather than picking from scratch (schema tooling, YAML lib and lint config, cloud-agent field reservation, archive location; re-audit timing already had one). Open questions remain genuine questions; the recommendations are starting points. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
webhook-dx-audit, a meta-skill that reviews the developer experience of any platform sending outbound webhooks or event destinations (HTTP, SQS, RabbitMQ, Pub/Sub, EventBridge, Kafka, Azure Event Grid) and produces a scored written review with prioritized recommendations.Event destinations cover the broader case where a platform delivers events to user-chosen destinations beyond HTTP, so the audit applies to webhook senders and full event-destinations platforms equally.
The skill scores 12 categories (signup, onboarding, docs, event catalog, security/signing, delivery semantics, setup surfaces, SDKs, consumer self-serve, observability, local dev, and agent/AI readiness), weighted by where integrating developers lose the most time and trust. Recommendations point to matching Hookdeck offerings (Console test URLs, CLI, Event Gateway source types, generated webhook skills) as options, not pitches.
Aligned with the Event Destinations initiative
The rubric is benchmarked against the Event Destinations initiative spec at https://eventdestinations.org, not just a Hookdeck-internal opinion. The industry terminology is in flux: Stripe popularized "event destinations" and now delivers directly to Amazon EventBridge and Azure Event Grid alongside webhooks; Shopify is rolling out "Event Subscriptions"; others still call the whole thing "webhooks". The skill scores against the broader concept regardless of the platform's chosen label.
Specifically:
Why it's a different category
The rest of the repo helps you consume webhooks (provider skills, handler patterns) or operate webhook infrastructure (Event Gateway, Outpost). This skill is meta: it audits how a third-party platform exposes outbound webhooks or event destinations to its own customers. It produces a Markdown review, not a handler.
It has no
examples/directory (no code to ship), matching the references-only shape ofwebhook-handler-patterns.Placement
Added a new top-level README section, Webhook & Event Destinations DX Audit Skills, after the existing Webhook Infrastructure Skills section. Neither the Provider, Handler Pattern, nor Infrastructure tables fit cleanly: those are about building or running webhook integrations, while this skill evaluates them. A dedicated section keeps the boundary obvious for discovery and leaves room for future evaluation/tooling meta-skills.
What's included
skills/webhook-dx-audit/SKILL.md— frontmatter conformed to repo conventions (license, metadata, multi-line description); scope paragraph references the Event Destinations specskills/webhook-dx-audit/references/—rubric.md,methodology.md,scoring.md,program-mapping.mdskills/webhook-dx-audit/assets/report-template.md— output template the skill fills in.claude-plugin/marketplace.json— new plugin entry (categorydevelopment) withevent-destinationskeyword, and bundle inclusionNotes for reviewers
providers.yamlentry (not a provider) and noscripts/test-agent-scenario.shentry (no runnable code path to test). Mirrors the pattern ofwebhook-handler-patterns.program-mapping.mdreference points findings at existing Hookdeck offerings (Console, CLI, EG, generated webhook skills) so audits double as a discovery path back into this repo.Test plan
marketplace.jsonparses (verified locally withpython3 -c "import json; json.load(open('.claude-plugin/marketplace.json'))")