A complete guide to build, operate, and extend FlakeGuard with Claude Code (MAX). Includes: Why & what to build, exact prompts, repo layout, CI/CD, GitHub App wiring, quality gates, ops runbooks, and upgrade paths.
FlakeGuard automatically detects intermittently failing (“flaky”) tests from CI signals, quarantines them safely (without masking true regressions), files repair tickets with context, and tracks “deflake” SLAs.
Goals
- Cut time-to-green for PRs by auto-recognizing known flakes and gating on only deterministic failures.
- Reduce false red builds and rebuild churn by ≥30% within 30 days.
- Provide actionable context: failure clustering, recent pass/fail streaks, environment diffs, and code owners.
- Offer safe quarantine: mark as “flakes under remediation” with clear expirations + mandatory follow-ups (no permanent ignore).
Non-goals (v1)
- Full test observability platform; we focus on flake lifecycle (detect → quarantine → fix).
- Universal CI support; v1: GitHub Actions first. (Jenkins/GitLab optional adapters later.)
Leading indicators
- % of red pipelines auto-reclassified as “flake suspected”.
- Mean re-run count per PR (target ↓).
- Number of quarantined tests resolved per week.
DORA alignment
We report impact using the four DORA metrics (deployment frequency, lead time, change failure rate, MTTR). Improving flake handling typically reduces change failure rate and MTTR. (Google Cloud)
Core pieces
-
Ingestion: Webhooks from GitHub check_suite, workflow_job, and check_run + downloading job logs/artifacts for parsing.
-
Classifier: Rules + heuristics + optional LLM summary to label failures (flake candidate vs deterministic).
-
Quarantine engine: Opens/updates PR comments, sets Check Run with up to three interactive buttons (e.g., “Re-run”, “Quarantine 7d”, “Escalate”). Checks API restricts to max 3 requested actions. (GitHub Docs)
-
Registry: SQLite/Postgres storing test keys, flake counters, owners, last-seen env.
-
Integrations
- GitHub App (checks:write, pull_requests:read, contents:read minimally).
- Optional Slack notifier for escalations (acknowledge within 3 seconds rule). (Slack API)
-
Observability: OpenTelemetry for Node.js service + Fastify auto-instrumentation. (OpenTelemetry, npm)
Event flow (GitHub Actions)
- Workflow finishes → GitHub sends webhook → Ingestion fetches job metadata and short-lived artifact URLs (download URL expires quickly; treat as ephemeral). (GitHub Docs)
- Parser derives test-case keys (suite::name::file::seed), error signatures, env hash (os, image, Node version).
- Classifier consults registry (confidence threshold) → sets Check Run with status + buttons. (3 buttons max.) (GitHub Docs)
- If user clicks “Quarantine 7d” → bot opens/updates tracking issue, updates quarantine list config, posts CODEOWNERS-routed ping. (CODEOWNERS precedence matters.) (GitHub Docs)
- Dual-window burn-rate alerts for flake backlog (operational SLOs). (Google SRE, Datadog)
flakeguard/
apps/
api/ # Fastify + Checks API + Slack (optional)
github-app/ # App manifest, permissions, webhook router
cli/ # Local tools (triage, migrations, backfills)
packages/
core/ # Test parser, signatures, quarantine rules
adapters/gha/ # GitHub Actions parser & artifact readers
adapters/junit/ # JUnit/XML readers
infra/
docker/ # Images & compose for local
ops/ # Helm/chart or Terraform later
.github/
workflows/ # CI pipelines, release
CODEOWNERS
SECURITY.md
CONTRIBUTING.md
- Repo name:
flakeguard - GitHub description: “Detect, quarantine, and fix flaky tests for GitHub Actions — safe by default, actionable by design.”
- Topics:
flaky-tests,ci,github-actions,testing,devx,sre,observability,check-runs,sbom,security
- License: choose MIT (permissive) or Apache-2.0 (patent grant). See guidance. (Choose a License, GitHub)
- .gitignore: base on official Node & macOS/Linux templates from GitHub. (conventionalcommits.org)
- Enable required status checks before merge (protect
main), so FlakeGuard’s checks must pass to merge. (GitHub Docs) - Keep GITHUB_TOKEN least-privilege using workflow/job-level
permissions:overrides (org default: restricted). (GitHub Docs) - CODEOWNERS rules: note last-matching pattern wins. (GitHub Docs)
- Security scans: Gitleaks (secrets), Syft (SBOM), Grype (vuln scan). (GitHub)
- Node.js 20 LTS, pnpm, TypeScript, Fastify, Zod, Vitest.
- OpenTelemetry SDK +
@opentelemetry/auto-instrumentations-node(adds HTTP/Express/Fastify spans out-of-the-box). (OpenTelemetry)
-
Create GitHub App → set permissions minimal:
- Repository: Contents: Read, Pull requests: Read, Checks: Read & Write, Actions: Read.
- Webhooks: subscribe to check_suite, check_run, pull_request, workflow_job.
- Rationale: enables Check Runs UI + reads CI outputs without broad write access (least-privilege). (GitHub Docs)
-
Configure webhook secret + App private key; deploy Fastify receiver (ngrok or public URL).
-
Install App on target repo(s) with repository-level access.
-
In repo Settings → Branches: add rule: require FlakeGuard check + CI unit tests. (GitHub Docs)
Note (Checks UI): Keep interactive actions ≤ 3 buttons (e.g., “Re-run failing jobs”, “Quarantine 7d”, “Escalate to owners”). (GitHub Docs)
Workflows
ci.yml: install, build, unit tests, type-check.e2e.yml: matrix against Node versions & OS.security.yml: gitleaks → syft (SBOM) → grype (scan). (GitHub)
Permissions
- Default repo Workflow permissions: Read only, then elevate per job (e.g.,
checks: writefor release bot) — aligns with least-privilege guidance. (GitHub Docs)
Artifacts
- When the API fetches job artifacts/logs, respect ephemeral download URLs (re-request if expired). (GitHub Docs)
- Service SLO: Webhook ingest p95 < 1.0s; decision latency p95 < 20s.
- Flake backlog SLO: Burn-rate monitor using multi-window, multi-burn-rate: e.g., (1h & 5m, BR 14.4) + (6h & 30m, BR 6). (Google SRE, SoundCloud 開發者)
- Rationale and patterns are defined by Google SRE Workbook; multi-window reduces noise and resolves fast after a fix. (Google SRE)
Use these verbatim in Claude Code’s chat. Keep them small and iterative; ask Claude Code to open/modify specific files and to produce diffs.
Role: Repository Bootstrapper
Goal: Initialize FlakeGuard monorepo with pnpm workspaces, TS strict, Fastify API, and packages/core.
Actions:
1) Create packages/core with tsconfig, jest/vitest config, and an exported parseFailure() placeholder.
2) Create apps/api (Fastify + health route /healthz, JSON logging).
3) Create apps/github-app with a Fastify plugin that handles POST /webhook and verifies X-Hub-Signature-256.
4) Add pnpm workspace root configs, eslint + prettier, and scripts: build, test, dev.
5) Generate .gitignore (Node + OS) and LICENSE (MIT). Keep diff small, compile & test.
Acceptance:
- `pnpm -w install && pnpm -w -r build` succeeds.
- `curl localhost:3000/healthz` returns 200 in dev.
- Show all file diffs and any new scripts.
Role: GitHub Checks Integrator
Goal: Implement create/update Check Runs with up to 3 "requested_action" buttons.
Constraints:
- Max 3 actions per check run (enforced by API).
- Include summary markdown with failure clusters and suspected flakes.
Tasks:
- Add github/rest client wrapper in apps/api/src/gh.ts.
- Implement POST /check-runs to create; PATCH /check-runs/:id to update status (queued, in_progress, completed).
- Wire webhook handlers for check_suite.completed and workflow_job.completed.
- Provide a pure function buildCheckRunPayload({status, conclusion, actions:[...]}) that never exceeds 3 actions.
Acceptance:
- Unit tests cover that supplying >3 actions is truncated with warning.
- Demo script creates a check on a fake SHA and prints the URL.
(Checks API 3-action limit) (GitHub Docs)
Role: GHA Artifact Ingestion Engineer
Goal: Given a run_id, fetch artifacts & logs, parse JUnit XML to normalized failures.
Constraints:
- Artifact download URLs are short-lived; always request fresh URL, handle expiry by refetch.
- Don’t assume public visibility.
Tasks:
- Implement getWorkflowArtifacts(runId) -> [{name, expiresAt, downloadUrl}]
- Add retry/backoff if HTTP 403/Expired.
- Write parser: packages/adapters/junit: parseJUnit(xml) -> { testKey, messageHash, stack, file, line }
- For logs, add simple regex to extract "FAILED" blocks when JUnit absent.
Acceptance:
- Unit tests with recorded fixtures (small samples).
- A CLI command `fg ingest --run <id>` prints top failure signatures.
(Ephemeral artifact URLs) (GitHub Docs)
Role: Flake Classifier
Goal: Implement flake scoring and quarantine action with reversible config.
Inputs:
- Past N outcomes for testKey, env hash, messageHash similarity.
Tasks:
- packages/core: scoreFlake({passStreak, failStreak, variability, envEntropy}) -> 0..1
- Quarantine config file in repo: .flakeguard/quarantine.yml with entries {testKey, until, reason, owner}.
- “Quarantine 7d” button updates config via PR and posts comment tagging CODEOWNERS.
Acceptance:
- If score >= 0.8 AND test previously passed on re-run, status = "flake suspected".
- Creating PR modifies .flakeguard/quarantine.yml and adds a TODO with expiration date.
Role: Slack Integrator
Goal: Notify #ci-flakes with a compact message when quarantine exceeds SLA.
Constraints:
- Respond/ack to interactive actions within 3 seconds.
- Support Socket Mode for local dev.
Acceptance:
- Post a block kit message with testKey, owner, days in quarantine, button "Create Jira".
(3-second ack; Socket Mode basics) (Slack API)
Role: Telemetry Engineer
Goal: Enable OpenTelemetry auto-instrumentation for Fastify API with traces & metrics.
Tasks:
- Add @opentelemetry/sdk-node + auto-instrumentations-node, exporter OTLP/HTTP env-configurable.
- Create tracer provider bootstrap; span for webhook handlers; attributes: event_type, run_id.
- Add /metrics (prom-compatible) and latency histograms.
Acceptance:
- Local collector receives spans when POST /webhook invoked.
Role: CI Security Engineer
Goal: Add "security.yml" workflow for Gitleaks → Syft SBOM → Grype vuln scan.
Acceptance:
- security job fails on secrets found or critical CVEs.
- Upload SBOM artifact and attach as build artifact.
(GitHub)
Role: Repo Guardian
Goal: Enforce required checks and sane review rules.
Tasks:
- Add .github/CODEOWNERS with specific to general patterns (remember: last match wins).
- Turn on branch protection with "Require status checks" and FlakeGuard check required.
- Add commit lint for Conventional Commits and semver-calculated release notes.
Acceptance:
- PR cannot merge unless FlakeGuard + tests succeed and owners approve.
(GitHub Docs, Semantic Versioning, Keep a Changelog)
- Create GitHub App per §8. Save App ID, Client ID/secret, private key, webhook secret.
- Install App on target repos.
- Set repo protections: Require status checks (FlakeGuard; unit tests). (GitHub Docs)
- Set Actions → Workflow permissions: default Read; raise per job in YAML. (GitHub Docs)
- Choose license (MIT/Apache-2.0) and commit via
LICENSEfile. (Choose a License) - Add .gitignore from GitHub templates (Node + OS). (conventionalcommits.org)
- Commits: Conventional Commits (
feat:,fix:,docs:…) → enables auto-release & changelog. (Semantic Versioning) - Versioning: SemVer 2.0.0. Breaking change → major. (Keep a Changelog)
- CHANGELOG: human-friendly per Keep a Changelog. (Ghinda)
- Unit tests added/updated for classifiers and adapters.
- Check Run shows ≤ 3 actions, labels + summary render correctly. (GitHub Docs)
- No secrets in diff (Gitleaks green). (GitHub)
- SBOM attached; no critical vulns (Grype). (GitHub)
- Branch is green on required checks; reviewers per CODEOWNERS approved. (GitHub Docs)
Webhooks failing (Slack/HTTP)
- Check 2xx within 3s rule; add async queue; ack immediately, process later. (Slack API)
Artifacts download 403
- Re-request signed URL; they can expire quickly. (GitHub Docs)
No check appears on PR
- App permissions missing (Checks: write), or branch protection blocking unknown checks.
Too many buttons in the Check
- Trim to ≤3; the API rejects extras. (GitHub Docs)
- Jenkins/GitLab adapters.
- Heuristics → ML: cluster by stacktrace embeddings, env diffs.
- Policy packs: org-wide quarantine SLA policy, auto-escalation to owners.
- UI: historical flake dashboard with SLO burn-rate panels (multi-window). (Google SRE)
Paste one task at a time; ask for diffs & tests each time.
- Workspace & API skeleton — as in §11.1
- Checks API wrapper & webhook routes — §11.2
- Artifact ingestion + JUnit parser + CLI — §11.3
- Classifier & quarantine PR flow — §11.4
- CODEOWNERS + protections + commitlint — §11.8
- Security workflow (gitleaks/syft/grype) — §11.7
- Telemetry (OTel) — §11.6
- Seed sample repo & fixtures (E2E)
- Slack escalation (opt-in) — §11.5
- Docs: SECURITY.md, CONTRIBUTING.md, README diagrams (include SLOs and burn-rate explainer). (Google SRE)
- Release train: weekly.
feat:bumps minor,fix:bumps patch (SemVer). (Keep a Changelog) - Changelog generated from Conventional Commits → publish GH release. (Ghinda)
- GitHub App listing: follow Marketplace guidelines if publishing broadly. (Omi AI)
Minimal workflow permissions
# .github/workflows/ci.yml
permissions:
contents: read
checks: write # only if this job updates checks
actions: read
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v4
- run: pnpm i && pnpm -r build && pnpm -r test(Principle of least privilege for GITHUB_TOKEN) (GitHub Docs)
CODEOWNERS precedence reminder Last matching pattern wins; keep specific rules last. (GitHub Docs)
Burn-rate pairs (starter) (1h & 5m, BR 14.4) and (6h & 30m, BR 6) — low noise + quick resolution. (SoundCloud 開發者)
README intro
FlakeGuard watches your CI, finds intermittent failures, and gives you buttons to re-run, quarantine, or escalate — without hiding real regressions. Integrates natively with GitHub Checks and your Slack.
Badges
- CI, Security Scan (gitleaks/grype), SBOM available, Coverage.