Skip to content

Latest commit

 

History

History
258 lines (204 loc) · 9.65 KB

File metadata and controls

258 lines (204 loc) · 9.65 KB

feat(flakeguard): A1–A5, B1–B3 action-only pipeline with soft gating

📋 What

This PR implements the complete FlakeGuard MVP with Actions-only pipeline and soft gating mechanisms:

A1-A5: Core FlakeGuard Engine Implementation

  • A1: Problem Matcher Author - GitHub Actions problem matchers for test/log error annotations
  • A2: Job Summary Author - Rich Markdown reports to $GITHUB_STEP_SUMMARY
  • A3: Auto-Rerun Engineer - Intelligent rerun for flaky-dominant failures with loop prevention
  • A4: Quarantine Governor & PR Notifier - TTL validation and owner notification system
  • A5: Artifact Retriever - Robust artifact downloads with URL expiration handling

B1-B3: Quality & Infrastructure

  • B1: Branch Protection Auditor - Automated workflow analysis and protection configuration
  • B2: Enhanced Workflows - Integrated FlakeGuard analysis with security permissions
  • B3: Comprehensive CLI Tools - Full command-line interface for all FlakeGuard operations

🎯 Why

Business Value:

  • Reduce False Negatives: Flaky tests no longer block legitimate code changes
  • Improve CI Reliability: Automatic reruns for known flakes with 70% threshold
  • Cost Optimization: Smart detection prevents unnecessary workflow reruns
  • Developer Experience: Transparent decision making with clear audit trails

Technical Benefits:

  • Robust Loop Prevention: SHA-based lock files prevent infinite rerun cycles
  • Intelligent Analysis: Frequency × volatility scoring for flake confidence
  • Comprehensive Logging: Rich GitHub Step Summaries with actionable insights
  • Enterprise Security: Least-privilege permissions and secret scanning integration

🔧 How to Test

1. FlakeGuard Problem Matchers & Annotations

# Run CI workflow and check PR Files tab for annotations
git push origin feature/flakeguard-mvp

# Verify problem matcher registration
grep -r "add-matcher" .github/workflows/

Expected: PR annotations appear on changed lines showing test failures and FlakeGuard analysis

2. Job Summary Generation

# Test summary generation with sample data
pnpm --filter @flakeguard/cli exec tsx src/summary.ts generate \
  --report .flakeguard/out/flaky-report.json \
  --quarantine .flakeguard/quarantine.yml

Expected: Rich Markdown summary with metrics tables, top flaky tests, and governance status

3. Auto-Rerun Logic

# Demo auto-rerun decision making
node demo-auto-rerun.js

# Test with different scenarios
pnpm --filter @flakeguard/cli exec tsx src/auto-rerun.ts analyze \
  --report .flakeguard/out/flaky-report.json \
  --quarantine .flakeguard/quarantine.yml \
  --sha "test-sha-123" \
  --run-id "99887766" \
  --dry-run

Expected:

  • Flaky-dominant failures (≥70%) trigger rerun
  • Lock files prevent multiple reruns per SHA
  • Clear logging shows decision reasoning

4. Quarantine Governance

# Test quarantine schema validation
pnpm --filter @flakeguard/cli exec tsx src/quarantine-governor.ts validate .flakeguard/quarantine.yml

# Generate PR governance comment
pnpm --filter @flakeguard/cli exec tsx src/quarantine-governor.ts report \
  .flakeguard/out/flaky-report.json .flakeguard/quarantine.yml

Expected:

  • Schema validation passes for quarantine config
  • PR comments show expiring quarantines and owner mentions
  • CODEOWNERS mapping works correctly

5. Artifact Retrieval with Retry Logic

# Test artifact listing and download
pnpm --filter @flakeguard/cli exec tsx src/fetch-artifact.ts list microsoft vscode

# Test bulk download with filtering
pnpm --filter @flakeguard/cli exec tsx src/bulk-artifact-download.ts download-filtered \
  microsoft vscode ./test-downloads "logs.*electron"

Expected:

  • Robust retry logic handles URL expiration (1-minute TTL)
  • Clear logging shows download status and retry attempts
  • Non-retryable errors correctly identified

6. Branch Protection Auditing

# Analyze workflow configuration
pnpm --filter @flakeguard/cli exec tsx src/branch-protection-auditor.ts analyze

# Generate protection configuration
pnpm --filter @flakeguard/cli exec tsx src/branch-protection-auditor.ts config

Expected:

  • 14 required checks identified correctly
  • Clear separation of stable vs experimental checks
  • Ready-to-use branch protection configuration

🛡️ Security & Permissions

Explicit Least-Privilege Configuration

permissions:
  contents: read        # Repository checkout and config access
  pull-requests: write  # PR comments and annotations  
  actions: read        # Artifact downloads and workflow logs
  checks: write        # FlakeGuard status check creation
  security-events: write # SARIF upload for security scans

Secret Scanning Integration

  • TruffleHog and Gitleaks integration
  • No secrets exposed in logs or artifacts
  • Secure token handling for GitHub API calls

📊 Key Metrics & Acceptance Criteria

A1: Problem Matchers ✅

  • Regex patterns for Vitest, Jest, TypeScript, ESLint
  • FlakeGuard-specific patterns for [FLAKY], [FAILED], [QUARANTINED]
  • PR annotations appear on relevant lines
  • Workflow registration with ::add-matcher::

A2: Job Summaries ✅

  • Rich Markdown with tables, metrics, recommendations
  • Governance section linking to PR comments
  • Artifact URLs and environment analysis
  • Color-coded badges and urgency indicators

A3: Auto-Rerun ✅

  • 70% flaky-dominance threshold implementation
  • SHA-based lock files (.flakeguard/.rerun.lock.<sha>)
  • Dual execution (REST API + gh CLI)
  • Maximum 1 rerun per commit enforcement

A4: Quarantine Governor ✅

  • Zod schema validation for {testKey, owner, reason, until}
  • CODEOWNERS file parsing with precedence rules
  • Consolidated PR comments with owner mentions
  • TTL validation and expiration warnings

A5: Artifact Retriever ✅

  • GitHub API integration with redirect handling
  • 1-minute URL TTL refresh logic
  • Exponential backoff retry strategy
  • Bulk download with concurrency control

B1: Branch Protection Auditor ✅

  • 10 workflows analyzed with 44 total jobs
  • 14 required vs 30 optional/conditional checks
  • Complete policy documentation generated
  • CLI tools for ongoing configuration management

🔄 Rollback Plan

Immediate Rollback (< 5 minutes)

# Disable FlakeGuard workflows
gh workflow disable flakeguard.yml
gh workflow disable pr-notifications.yml
gh workflow disable quarantine-pr.yml

# Remove from branch protection (if enabled)
# Manual: Settings → Branches → Edit protection rule → Remove FlakeGuard checks

Gradual Rollback (Recommended)

  1. Week 1: Mark FlakeGuard checks as non-required in branch protection
  2. Week 2: Disable auto-rerun functionality (add --dry-run flag)
  3. Week 3: Disable quarantine notifications (comment out PR steps)
  4. Week 4: Full workflow disabling if needed

Data Preservation

  • Quarantine configuration remains in .flakeguard/quarantine.yml
  • Auto-rerun lock files preserve rerun history
  • No data loss - only analysis and notifications stop

📝 Testing Checklist

Core Functionality

  • Problem Matchers: PR annotations appear for test failures
  • Job Summaries: Rich Markdown reports in Actions UI
  • Auto-Rerun: Flaky-dominant detection triggers rerun exactly once
  • Quarantine: TTL validation and owner notifications work
  • Artifacts: Retry logic handles URL expiration correctly
  • Branch Protection: Auditor identifies correct required checks

Security & Permissions

  • Least Privilege: Workflows use minimal required permissions
  • Secret Scanning: No secrets exposed in logs or artifacts
  • Token Security: GitHub tokens handled securely
  • CODEOWNERS: Owner mapping works with precedence rules

Integration & Reliability

  • Workflow Integration: All FlakeGuard steps run without errors
  • Error Handling: Graceful degradation when components fail
  • Performance: No significant impact on CI runtime
  • Compatibility: Works across different runner environments

Documentation & Usability

  • CLI Help: All commands provide clear usage information
  • Error Messages: Helpful error messages with troubleshooting steps
  • Demo Scripts: All demo scripts run successfully
  • Policy Docs: Branch protection policy is clear and actionable

🎯 Demo Commands

# Run complete FlakeGuard demos
node demo-auto-rerun.js
node demo-quarantine-governor.js  
node demo-artifact-retrieval.js

# Test CLI tools
pnpm --filter @flakeguard/cli exec tsx src/branch-protection-auditor.ts analyze
pnpm --filter @flakeguard/cli exec tsx src/quarantine-governor.ts validate
pnpm --filter @flakeguard/cli exec tsx src/fetch-artifact.ts list microsoft vscode

🚀 Post-Merge Actions

  1. Enable Branch Protection: Add required checks from BRANCH_PROTECTION.md
  2. Monitor Auto-Reruns: Track rerun frequency and success rates
  3. Quarantine Review: Weekly review of expiring quarantines
  4. Performance Monitoring: Track CI runtime impact
  5. Team Training: Share FlakeGuard CLI usage with team

📚 Documentation Added

  • BRANCH_PROTECTION.md - Complete branch protection configuration guide
  • REQUIRED_CHECKS_POLICY.md - Check lifecycle management policy
  • AUTO-RERUN-DOCUMENTATION.md - Auto-rerun engine technical documentation
  • Updated README.md with Required Checks Policy section
  • Comprehensive CLI help for all tools

Ready for Review: This PR implements the complete FlakeGuard MVP with robust testing, security, and documentation. Please validate using the testing checklist above.