This document describes the system architecture and design patterns used in FlakeGuard.
FlakeGuard is a distributed system designed to detect, monitor, and manage flaky tests in CI/CD pipelines. It follows a microservices architecture with clear separation of concerns.
graph TB
subgraph "External Systems"
GH[GitHub App]
GA[GitHub Actions]
SLACK[Slack Bot]
end
subgraph "FlakeGuard Core"
API[API Server<br/>Fastify + TypeScript]
WORKER[Worker Service<br/>BullMQ + TypeScript]
WEB[Web UI<br/>React + TypeScript]
end
subgraph "Data Layer"
PG[(PostgreSQL<br/>Test Results & Analysis)]
REDIS[(Redis<br/>Job Queue & Cache)]
end
subgraph "Infrastructure"
DOCKER[Docker Containers]
NGINX[Reverse Proxy]
MONITOR[Monitoring<br/>Prometheus + Grafana]
end
GH --> API
GA --> GH
SLACK --> API
API --> PG
API --> REDIS
WORKER --> PG
WORKER --> REDIS
WEB --> API
DOCKER --> API
DOCKER --> WORKER
DOCKER --> WEB
NGINX --> DOCKER
MONITOR --> DOCKER
The API server is the central component that handles:
- GitHub Webhook Processing: Receives and processes GitHub events
- REST API Endpoints: Provides programmatic access to FlakeGuard features
- Test Result Ingestion: Parses and stores JUnit XML test results
- Flake Analysis: Runs flakiness scoring algorithms
- Check Run Management: Creates and updates GitHub check runs
Technology Stack:
- Fastify 4 (Web framework)
- Prisma 5 (Database ORM)
- Zod (Schema validation)
- Pino (Structured logging)
- TypeScript (Type safety)
Key Features:
- Webhook signature validation
- Rate limiting and security headers
- Automatic OpenAPI documentation
- Health checks and metrics
- Error handling and logging
The worker service processes background jobs asynchronously:
- Test Analysis Jobs: Analyzes test results for flakiness patterns
- Report Generation: Creates quarantine reports and recommendations
- Notification Jobs: Sends alerts and notifications
- Cleanup Jobs: Maintains data retention policies
Technology Stack:
- BullMQ (Job queue)
- Redis (Job storage)
- TypeScript (Type safety)
Job Types:
enum JobType {
ANALYZE_TEST_RESULTS = 'analyze-test-results',
GENERATE_QUARANTINE_PLAN = 'generate-quarantine-plan',
SEND_NOTIFICATION = 'send-notification',
CLEANUP_OLD_DATA = 'cleanup-old-data'
}PostgreSQL Database:
- Primary data store for all persistent data
- Optimized indexes for query performance
- ACID compliance for data integrity
Key Models:
Installation: GitHub App installationsRepository: Repository metadata and settingsTestResult: Individual test execution resultsTestSuite: Test suite metadata and statisticsFlakeDetection: Flaky test analysis resultsCheckRun: GitHub check run tracking
Redis Cache:
- Job queue storage
- Session caching
- Rate limiting counters
- Temporary data storage
sequenceDiagram
participant GH as GitHub
participant API as API Server
participant QUEUE as Redis Queue
participant WORKER as Worker
participant DB as PostgreSQL
GH->>API: Webhook Event
API->>API: Validate Signature
API->>API: Parse Event
API->>DB: Store Event Data
API->>QUEUE: Enqueue Analysis Job
API->>GH: 200 OK
QUEUE->>WORKER: Process Job
WORKER->>DB: Fetch Test Data
WORKER->>WORKER: Run Analysis
WORKER->>DB: Store Results
WORKER->>API: Update Check Run
API->>GH: Update Check Run
sequenceDiagram
participant CLIENT as Client/CI
participant API as API Server
participant PARSER as JUnit Parser
participant DB as PostgreSQL
participant QUEUE as Redis Queue
CLIENT->>API: POST /api/ingestion/junit
API->>API: Validate Headers
API->>PARSER: Parse JUnit XML
PARSER->>PARSER: Extract Test Cases
PARSER->>API: Return Parsed Data
API->>DB: Store Test Results
API->>QUEUE: Enqueue Analysis Job
API->>CLIENT: 201 Created
sequenceDiagram
participant WORKER as Worker
participant DB as PostgreSQL
participant SCORER as Flake Scorer
participant API as API Server
participant GH as GitHub
WORKER->>DB: Fetch Test History
WORKER->>SCORER: Analyze Flakiness
SCORER->>SCORER: Calculate Score
SCORER->>WORKER: Return Analysis
WORKER->>DB: Store Detection
WORKER->>API: Trigger Check Run Update
API->>GH: Update Check Run
graph TD
subgraph "Authentication Flow"
CLIENT[Client Request]
VALIDATE[Validate Token]
GITHUB[GitHub App Auth]
ACCESS[Access Control]
end
CLIENT --> VALIDATE
VALIDATE --> GITHUB
GITHUB --> ACCESS
ACCESS --> API[API Endpoint]
Security Layers:
- GitHub App Authentication: OAuth2 flow with private key signing
- Webhook Signature Validation: HMAC-SHA256 verification
- Rate Limiting: Per-IP and per-endpoint limits
- Input Validation: Zod schema validation
- CORS Protection: Configurable origin policies
- Security Headers: Helmet.js security headers
- Encryption at Rest: Database encryption for sensitive data
- Encryption in Transit: TLS 1.3 for all communications
- Secret Management: Environment variables with rotation
- Access Logging: Comprehensive audit trails
API Server Scaling:
- Stateless design enables horizontal scaling
- Load balancer distributes requests
- Auto-scaling based on CPU/memory usage
Worker Scaling:
- Multiple worker processes
- Job distribution via Redis queue
- Dynamic scaling based on queue depth
Database Optimization:
- Optimized indexes for common queries
- Connection pooling
- Query optimization and monitoring
- Read replicas for analytics queries
Caching Strategy:
// Multi-layer caching
interface CacheStrategy {
l1: 'memory', // In-process cache
l2: 'redis', // Distributed cache
l3: 'database' // Persistent storage
}Query Patterns:
-- Optimized for time-series analysis
CREATE INDEX CONCURRENTLY idx_test_results_time_series
ON test_results (repository_id, created_at DESC, status);
-- Optimized for flake detection
CREATE INDEX CONCURRENTLY idx_test_results_flake_analysis
ON test_results (repository_id, test_full_name, created_at DESC);// Prometheus metrics
interface Metrics {
httpRequests: Counter;
processingDuration: Histogram;
activeJobs: Gauge;
errorRate: Counter;
flakeDetectionAccuracy: Histogram;
}// Structured logging with correlation IDs
interface LogEntry {
timestamp: string;
level: 'info' | 'warn' | 'error';
message: string;
correlationId: string;
context: Record<string, any>;
}API Health Endpoints:
/health: Basic liveness check/health/ready: Readiness check (database connectivity)/health/deep: Deep health check (all dependencies)
Worker Health Monitoring:
- Job processing metrics
- Queue depth monitoring
- Error rate tracking
# Multi-stage build for optimization
FROM node:20-alpine AS base
FROM base AS deps
FROM base AS builder
FROM base AS runner# Docker Compose for development
version: '3.8'
services:
api:
build: ./apps/api
environment:
- DATABASE_URL
- REDIS_URL
depends_on:
- postgres
- redis
worker:
build: ./apps/worker
environment:
- DATABASE_URL
- REDIS_URL
depends_on:
- postgres
- redis
postgres:
image: postgres:16
environment:
POSTGRES_DB: flakeguard
volumes:
- postgres_data:/var/lib/postgresql/data
redis:
image: redis:7-alpine
volumes:
- redis_data:/dataenum ErrorType {
VALIDATION_ERROR = 'validation_error',
AUTHENTICATION_ERROR = 'authentication_error',
RATE_LIMIT_ERROR = 'rate_limit_error',
GITHUB_API_ERROR = 'github_api_error',
DATABASE_ERROR = 'database_error',
PROCESSING_ERROR = 'processing_error'
}// Exponential backoff with jitter
class RetryStrategy {
async execute<T>(
operation: () => Promise<T>,
maxRetries: number = 3,
baseDelay: number = 1000
): Promise<T> {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await operation();
} catch (error) {
if (attempt === maxRetries) throw error;
const delay = baseDelay * Math.pow(2, attempt - 1);
const jitter = Math.random() * 0.1 * delay;
await this.sleep(delay + jitter);
}
}
throw new Error('Max retries exceeded');
}
}// Circuit breaker for external APIs
class CircuitBreaker {
private failures = 0;
private state: 'closed' | 'open' | 'half-open' = 'closed';
private threshold = 5;
private timeout = 60000;
}interface Config {
// Server configuration
server: {
port: number;
host: string;
environment: 'development' | 'production' | 'test';
};
// Database configuration
database: {
url: string;
poolSize: number;
ssl: boolean;
};
// GitHub App configuration
github: {
appId: string;
privateKey: string;
webhookSecret: string;
};
// Feature flags
features: {
slackIntegration: boolean;
advancedAnalytics: boolean;
quarantineRecommendations: boolean;
};
}// Zod schema for configuration validation
const configSchema = z.object({
server: z.object({
port: z.number().min(1).max(65535),
host: z.string().min(1),
environment: z.enum(['development', 'production', 'test'])
}),
database: z.object({
url: z.string().url(),
poolSize: z.number().min(1).max(100),
ssl: z.boolean()
})
}); E2E Tests (Few)
┌─────────────────┐
│ Integration │
│ Tests (Some) │
└─────────────────┘
┌───────────────────┐
│ Unit Tests (Many) │
└───────────────────┘
Unit Tests:
- Business logic validation
- Algorithm correctness
- Utility function testing
Integration Tests:
- Database interactions
- API endpoint testing
- Service integration
End-to-End Tests:
- Complete workflow testing
- GitHub integration testing
- User journey validation
graph TB
subgraph "Current Architecture"
API1[API Server]
WORKER1[Worker Service]
end
subgraph "Future Architecture"
GATEWAY[API Gateway]
AUTH[Auth Service]
ANALYSIS[Analysis Service]
NOTIFICATION[Notification Service]
REPORTING[Reporting Service]
end
API1 --> GATEWAY
WORKER1 --> ANALYSIS
WORKER1 --> NOTIFICATION
// Event sourcing for audit trail
interface Event {
id: string;
type: string;
aggregateId: string;
payload: Record<string, any>;
timestamp: Date;
version: number;
}// ML pipeline for flake prediction
interface MLPipeline {
featureExtraction: (testData: TestResult[]) => Features;
modelInference: (features: Features) => Prediction;
modelRetraining: (feedback: Feedback[]) => Model;
}This architecture provides a solid foundation for scalable, maintainable, and secure flaky test detection while allowing for future enhancements and optimizations.