feat(flakeguard): A1–A5, B1–B3 action-only pipeline with soft gating

📋 What

This PR implements the complete FlakeGuard MVP with Actions-only pipeline and soft gating mechanisms:

A1-A5: Core FlakeGuard Engine Implementation

A1: Problem Matcher Author - GitHub Actions problem matchers for test/log error annotations
A2: Job Summary Author - Rich Markdown reports to $GITHUB_STEP_SUMMARY
A3: Auto-Rerun Engineer - Intelligent rerun for flaky-dominant failures with loop prevention
A4: Quarantine Governor & PR Notifier - TTL validation and owner notification system
A5: Artifact Retriever - Robust artifact downloads with URL expiration handling

B1-B3: Quality & Infrastructure

B1: Branch Protection Auditor - Automated workflow analysis and protection configuration
B2: Enhanced Workflows - Integrated FlakeGuard analysis with security permissions
B3: Comprehensive CLI Tools - Full command-line interface for all FlakeGuard operations

🎯 Why

Business Value:

Reduce False Negatives: Flaky tests no longer block legitimate code changes
Improve CI Reliability: Automatic reruns for known flakes with 70% threshold
Cost Optimization: Smart detection prevents unnecessary workflow reruns
Developer Experience: Transparent decision making with clear audit trails

Technical Benefits:

Robust Loop Prevention: SHA-based lock files prevent infinite rerun cycles
Intelligent Analysis: Frequency × volatility scoring for flake confidence
Comprehensive Logging: Rich GitHub Step Summaries with actionable insights
Enterprise Security: Least-privilege permissions and secret scanning integration

🔧 How to Test

1. FlakeGuard Problem Matchers & Annotations

# Run CI workflow and check PR Files tab for annotations
git push origin feature/flakeguard-mvp

# Verify problem matcher registration
grep -r "add-matcher" .github/workflows/

Expected: PR annotations appear on changed lines showing test failures and FlakeGuard analysis

2. Job Summary Generation

# Test summary generation with sample data
pnpm --filter @flakeguard/cli exec tsx src/summary.ts generate \
  --report .flakeguard/out/flaky-report.json \
  --quarantine .flakeguard/quarantine.yml

Expected: Rich Markdown summary with metrics tables, top flaky tests, and governance status

3. Auto-Rerun Logic

# Demo auto-rerun decision making
node demo-auto-rerun.js

# Test with different scenarios
pnpm --filter @flakeguard/cli exec tsx src/auto-rerun.ts analyze \
  --report .flakeguard/out/flaky-report.json \
  --quarantine .flakeguard/quarantine.yml \
  --sha "test-sha-123" \
  --run-id "99887766" \
  --dry-run

Expected:

Flaky-dominant failures (≥70%) trigger rerun
Lock files prevent multiple reruns per SHA
Clear logging shows decision reasoning

4. Quarantine Governance

# Test quarantine schema validation
pnpm --filter @flakeguard/cli exec tsx src/quarantine-governor.ts validate .flakeguard/quarantine.yml

# Generate PR governance comment
pnpm --filter @flakeguard/cli exec tsx src/quarantine-governor.ts report \
  .flakeguard/out/flaky-report.json .flakeguard/quarantine.yml

Expected:

Schema validation passes for quarantine config
PR comments show expiring quarantines and owner mentions
CODEOWNERS mapping works correctly

5. Artifact Retrieval with Retry Logic

# Test artifact listing and download
pnpm --filter @flakeguard/cli exec tsx src/fetch-artifact.ts list microsoft vscode

# Test bulk download with filtering
pnpm --filter @flakeguard/cli exec tsx src/bulk-artifact-download.ts download-filtered \
  microsoft vscode ./test-downloads "logs.*electron"

Expected:

Robust retry logic handles URL expiration (1-minute TTL)
Clear logging shows download status and retry attempts
Non-retryable errors correctly identified

6. Branch Protection Auditing

# Analyze workflow configuration
pnpm --filter @flakeguard/cli exec tsx src/branch-protection-auditor.ts analyze

# Generate protection configuration
pnpm --filter @flakeguard/cli exec tsx src/branch-protection-auditor.ts config

Expected:

14 required checks identified correctly
Clear separation of stable vs experimental checks
Ready-to-use branch protection configuration

🛡️ Security & Permissions

Explicit Least-Privilege Configuration

permissions:
  contents: read        # Repository checkout and config access
  pull-requests: write  # PR comments and annotations  
  actions: read        # Artifact downloads and workflow logs
  checks: write        # FlakeGuard status check creation
  security-events: write # SARIF upload for security scans

Secret Scanning Integration

TruffleHog and Gitleaks integration
No secrets exposed in logs or artifacts
Secure token handling for GitHub API calls

📊 Key Metrics & Acceptance Criteria

A1: Problem Matchers ✅

Regex patterns for Vitest, Jest, TypeScript, ESLint
FlakeGuard-specific patterns for [FLAKY], [FAILED], [QUARANTINED]
PR annotations appear on relevant lines
Workflow registration with ::add-matcher::

A2: Job Summaries ✅

Rich Markdown with tables, metrics, recommendations
Governance section linking to PR comments
Artifact URLs and environment analysis
Color-coded badges and urgency indicators

A3: Auto-Rerun ✅

70% flaky-dominance threshold implementation
SHA-based lock files (.flakeguard/.rerun.lock.<sha>)
Dual execution (REST API + gh CLI)
Maximum 1 rerun per commit enforcement

A4: Quarantine Governor ✅

Zod schema validation for {testKey, owner, reason, until}
CODEOWNERS file parsing with precedence rules
Consolidated PR comments with owner mentions
TTL validation and expiration warnings

A5: Artifact Retriever ✅

GitHub API integration with redirect handling
1-minute URL TTL refresh logic
Exponential backoff retry strategy
Bulk download with concurrency control

B1: Branch Protection Auditor ✅

10 workflows analyzed with 44 total jobs
14 required vs 30 optional/conditional checks
Complete policy documentation generated
CLI tools for ongoing configuration management

🔄 Rollback Plan

Immediate Rollback (< 5 minutes)

# Disable FlakeGuard workflows
gh workflow disable flakeguard.yml
gh workflow disable pr-notifications.yml
gh workflow disable quarantine-pr.yml

# Remove from branch protection (if enabled)
# Manual: Settings → Branches → Edit protection rule → Remove FlakeGuard checks

Gradual Rollback (Recommended)

Week 1: Mark FlakeGuard checks as non-required in branch protection
Week 2: Disable auto-rerun functionality (add --dry-run flag)
Week 3: Disable quarantine notifications (comment out PR steps)
Week 4: Full workflow disabling if needed

Data Preservation

Quarantine configuration remains in .flakeguard/quarantine.yml
Auto-rerun lock files preserve rerun history
No data loss - only analysis and notifications stop

📝 Testing Checklist

Core Functionality

Problem Matchers: PR annotations appear for test failures
Job Summaries: Rich Markdown reports in Actions UI
Auto-Rerun: Flaky-dominant detection triggers rerun exactly once
Quarantine: TTL validation and owner notifications work
Artifacts: Retry logic handles URL expiration correctly
Branch Protection: Auditor identifies correct required checks

Security & Permissions

Least Privilege: Workflows use minimal required permissions
Secret Scanning: No secrets exposed in logs or artifacts
Token Security: GitHub tokens handled securely
CODEOWNERS: Owner mapping works with precedence rules

Integration & Reliability

Workflow Integration: All FlakeGuard steps run without errors
Error Handling: Graceful degradation when components fail
Performance: No significant impact on CI runtime
Compatibility: Works across different runner environments

Documentation & Usability

CLI Help: All commands provide clear usage information
Error Messages: Helpful error messages with troubleshooting steps
Demo Scripts: All demo scripts run successfully
Policy Docs: Branch protection policy is clear and actionable

🎯 Demo Commands

# Run complete FlakeGuard demos
node demo-auto-rerun.js
node demo-quarantine-governor.js  
node demo-artifact-retrieval.js

# Test CLI tools
pnpm --filter @flakeguard/cli exec tsx src/branch-protection-auditor.ts analyze
pnpm --filter @flakeguard/cli exec tsx src/quarantine-governor.ts validate
pnpm --filter @flakeguard/cli exec tsx src/fetch-artifact.ts list microsoft vscode

🚀 Post-Merge Actions

Enable Branch Protection: Add required checks from BRANCH_PROTECTION.md
Monitor Auto-Reruns: Track rerun frequency and success rates
Quarantine Review: Weekly review of expiring quarantines
Performance Monitoring: Track CI runtime impact
Team Training: Share FlakeGuard CLI usage with team

📚 Documentation Added

BRANCH_PROTECTION.md - Complete branch protection configuration guide
REQUIRED_CHECKS_POLICY.md - Check lifecycle management policy
AUTO-RERUN-DOCUMENTATION.md - Auto-rerun engine technical documentation
Updated README.md with Required Checks Policy section
Comprehensive CLI help for all tools

Ready for Review: This PR implements the complete FlakeGuard MVP with robust testing, security, and documentation. Please validate using the testing checklist above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(flakeguard): A1–A5, B1–B3 action-only pipeline with soft gating

📋 What

A1-A5: Core FlakeGuard Engine Implementation

B1-B3: Quality & Infrastructure

🎯 Why

🔧 How to Test

1. FlakeGuard Problem Matchers & Annotations

2. Job Summary Generation

3. Auto-Rerun Logic

4. Quarantine Governance

5. Artifact Retrieval with Retry Logic

6. Branch Protection Auditing

🛡️ Security & Permissions

Explicit Least-Privilege Configuration

Secret Scanning Integration

📊 Key Metrics & Acceptance Criteria

A1: Problem Matchers ✅

A2: Job Summaries ✅

A3: Auto-Rerun ✅

A4: Quarantine Governor ✅

A5: Artifact Retriever ✅

B1: Branch Protection Auditor ✅

🔄 Rollback Plan

Immediate Rollback (< 5 minutes)

Gradual Rollback (Recommended)

Data Preservation

📝 Testing Checklist

Core Functionality

Security & Permissions

Integration & Reliability

Documentation & Usability

🎯 Demo Commands

🚀 Post-Merge Actions

📚 Documentation Added

FilesExpand file tree

PR_DRAFT.md

Latest commit

History

PR_DRAFT.md

File metadata and controls

feat(flakeguard): A1–A5, B1–B3 action-only pipeline with soft gating

📋 What

A1-A5: Core FlakeGuard Engine Implementation

B1-B3: Quality & Infrastructure

🎯 Why

🔧 How to Test

1. FlakeGuard Problem Matchers & Annotations

2. Job Summary Generation

3. Auto-Rerun Logic

4. Quarantine Governance

5. Artifact Retrieval with Retry Logic

6. Branch Protection Auditing

🛡️ Security & Permissions

Explicit Least-Privilege Configuration

Secret Scanning Integration

📊 Key Metrics & Acceptance Criteria

A1: Problem Matchers ✅

A2: Job Summaries ✅

A3: Auto-Rerun ✅

A4: Quarantine Governor ✅

A5: Artifact Retriever ✅

B1: Branch Protection Auditor ✅

🔄 Rollback Plan

Immediate Rollback (< 5 minutes)

Gradual Rollback (Recommended)

Data Preservation

📝 Testing Checklist

Core Functionality

Security & Permissions

Integration & Reliability

Documentation & Usability

🎯 Demo Commands

🚀 Post-Merge Actions

📚 Documentation Added