This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
ERT (Errata Reliability Team) Release Tests - A Python-based automation framework for managing OpenShift z-stream releases. This project automates QE tasks throughout the release lifecycle, including advisory management, bug tracking, test execution, and release approval workflows.
Key Technologies:
- Python 3.11+
- Click framework (CLI)
- StateBox (GitHub-backed YAML for release state tracking)
- Google Sheets API (backward compatibility for old releases)
- Errata Tool, Jira, GitLab, Jenkins, Slack integrations
- Prow/Gangway (CI test orchestration)
- Kerberos authentication (for Red Hat internal services)
# Install the package in editable mode
pip3 install -e .
# Alternative: Use Makefile
make install
# Uninstall
make uninstall
# Clean reinstall
make clean-installPython Version: Requires Python 3.11+
# Run all tests with pytest
pytest
# Run specific test file
pytest tests/test_advisory.py
# Run with verbose output
pytest -v
# Run tests matching pattern
pytest -k test_jenkinsTest files are located in tests/ directory.
This project provides three main CLI entry points, plus an MCP server for AI agent integration:
Main CLI for z-stream release management:
# General syntax
oar -r <release-version> [OPTIONS] COMMAND [ARGS]
# Common commands (see AGENTS.md for full list)
oar -r 4.19.1 create-test-report
oar -r 4.19.1 take-ownership -e user@redhat.com
oar -r 4.19.1 update-bug-list
oar -r 4.19.1 check-greenwave-cvp-tests
oar -r 4.19.1 change-advisory-status
# Enable debug logging
oar -r 4.19.1 -v create-test-reportCLI for automated agents and controllers:
# Start release detector
oarctl start-release-detector -r 4.19
# Start Jira notificator
oarctl jira-notificator --dry-run
oarctl jira-notificator --from-date 2025-01-15Entry point is prow/setup.py which installs job and jobctl commands:
# Run specific Prow job
job run <job_name> --payload $image_pullspec
# Start job controller (monitors builds and triggers tests)
jobctl start-controller -r 4.19 --nightly --arch amd64
# Start test result aggregator
jobctl start-aggregator --arch amd64Exposes all OAR commands as MCP tools for AI agents:
# Start MCP server
cd mcp_server
python3 server.py
# Access via AI agents (Claude Code, etc.)
# Server runs at http://localhost:8000 by defaultSee the "MCP Server" section under Architecture for detailed information.
release-tests/
├── oar/ # Main OAR package
│ ├── cli/ # Click-based CLI commands
│ ├── core/ # Core modules (see below)
│ ├── controller/ # Release detector agent
│ └── notificator/ # Jira notificator agent
├── prow/ # Prow job controller system
│ └── job/ # Job orchestration, test aggregation
├── mcp_server/ # MCP server for AI agent integration
│ └── server.py # FastMCP server exposing OAR commands as tools
├── tools/ # Standalone tools (Slack bot, checkers)
├── tests/ # Unit tests
└── _releases/ # GitHub-based persistent state (build tracking)
The OAR system is built on a layered architecture where all modules depend on ConfigStore for configuration:
Foundation:
configstore.py- Encrypted config management (JWE), release-specific settingsexceptions.py- Custom exception types for all modulesutil.py- Shared utilities (version validation, URL builders, logging)
External Service Integrations:
advisory.py- Errata Tool interactions (requires Kerberos)jira.py- Jira API client for issue trackingstatebox.py- GitHub-backed YAML state management (primary for new releases)worksheet.py- Google Sheets test reports (backward compatibility for old releases)notification.py- Slack notificationsshipment.py- GitLab MR management for Konflux workflowjenkins.py- Jenkins job triggering and monitoringldap.py- LDAP lookups for manager hierarchygit.py- Git operations for shipment data
Orchestration Layer:
operators.py- Composite operators that coordinate across multiple modules (ReleaseOwnershipOperator, BugOperator, ApprovalOperator, ReleaseShipmentOperator, etc.)
Key Design Pattern: All core modules follow Manager/Helper pattern with:
- Integration with ConfigStore
- Custom exception handling
- Logging for observability
The system supports two distinct release workflows:
- Errata Flow (traditional): Advisory-based using Errata Tool
- Konflux Flow (newer): GitLab MR-based shipment workflow
OAR commands automatically detect and handle both flows based on ConfigStore configuration.
Six automated agents work together for end-to-end release automation (see AGENTS.md for details):
- Release Detector (
oar/controller/detector.py) - Detects new z-stream releases - Job Controller (
prow/job/controller.py) - Monitors builds, triggers Prow tests - Test Result Aggregator (
prow/job/controller.py) - Processes results, implements retry logic - Jira Notificator (
oar/notificator/jira_notificator.py) - Escalates unverified bugs - Slack Bot (
tools/slack_message_receiver.py) - Executes OAR commands via Slack - Test Result Checker (
tools/auto_release_test_result_checker.py) - Notifies about rejected builds
The MCP (Model Context Protocol) server (mcp_server/server.py) exposes OAR commands as structured tools for AI agents like Claude Code.
What it does:
- Wraps all OAR/oarctl/job/jobctl commands as MCP tools (27 total)
- Provides structured input/output for AI agent interaction
- Categorizes operations by safety (read-only, write, critical)
- Validates environment on startup
- Runs as HTTP server (standard MCP transport protocol) for remote access
- Uses direct Click invocation (70-90% faster than subprocess)
- Implements async concurrency via ThreadPoolExecutor for handling multiple AI agent requests simultaneously
Categories of tools:
- Read-only tools - Safe query operations (check-greenwave-cvp-tests, check-cve-tracker-bug, image-signed-check, is-release-shipped)
- Status check tools - Query job status (image-consistency-check -i, stage-testing -n)
- Write operations - Modify state (create-test-report, update-bug-list, take-ownership)
- Critical operations - Production impact (push-to-cdn-staging, change-advisory-status)
- Controller tools - Background agents (start-release-detector, jira-notificator)
- Job controller tools - CI orchestration (start-controller, trigger-jobs-for-build)
- Generic runners - Advanced usage (oar_run_command, jobctl_run_command)
- Configuration tools - Metadata access (oar_get_release_metadata, is-release-shipped)
- Cache management tools - Performance optimization (mcp_cache_stats, mcp_cache_invalidate, mcp_cache_warm)
Running the server:
# Navigate to MCP server directory
cd mcp_server
# Check environment setup
./check_env.sh
# Start server (default: localhost:8000)
python3 server.py
# For remote access (using run_server.sh)
./run_server.sh 0.0.0.0 8080
# Or edit server.py line 1936 directly:
# mcp.run(transport="http", host="0.0.0.0", port=8080)Health check endpoint:
The server exposes an HTTP health check endpoint at /health for monitoring and load balancer compatibility:
# Check server health
curl http://localhost:8000/health
# Example response (200 OK if healthy, 503 if degraded):
{
"status": "healthy",
"server": "release-tests-mcp",
"version": "1.0.0",
"transport": "http",
"tools": {
"total": 28,
"cli": 17,
"api": 11
},
"environment": {
"valid": true,
"missing_required": [],
"missing_optional": []
},
"kerberos": {
"valid": true,
"status": "valid"
},
"cache": {
"enabled": true,
"size": 0,
"max_size": 50,
"hit_rate": "0.00%",
"ttl_days": 7
},
"thread_pool": {
"size": 20,
"cpu_count": 10
},
"timestamp": "2025-12-22T10:30:00Z"
}Use cases:
- Load balancer liveness checks
- Container orchestration health probes (Kubernetes/OpenShift)
- Monitoring tools (Prometheus, Datadog, etc.)
- Manual health verification during debugging
Response codes:
200 OK- Server is healthy (all required environment variables configured AND valid Kerberos ticket)503 Service Unavailable- Server is degraded (missing required environment variables OR no Kerberos ticket)500 Internal Server Error- Health check itself failed (unexpected error)
Environment requirements:
- All OAR CLI environment variables (OAR_JWK, JIRA_TOKEN, GCP_SA_FILE, etc.)
- Server validates environment on startup and exits if critical vars missing
Use cases:
- AI-assisted release management workflows
- Automated release operations via Claude Code
- Interactive debugging and troubleshooting
- Documentation and training with AI guidance
Safety features:
- Operations clearly marked with warning emoji (
⚠️ WRITE,⚠️ CRITICAL) - Read-only operations for safe exploration
- Timeout handling (default 10 minutes)
- Structured error reporting
Client configuration:
For Claude Code CLI, configure the MCP server in ~/.claude.json:
{
"mcpServers": {
"release-tests": {
"transport": {
"type": "http",
"url": "http://localhost:8000/mcp"
}
}
}
}For remote servers:
{
"mcpServers": {
"release-tests": {
"transport": {
"type": "http",
"url": "http://your-server-hostname:8000/mcp"
}
}
}
}Development notes:
- Built with FastMCP framework
- Tool definitions include comprehensive docstrings for AI context
- All tools wrap existing CLI commands (no new business logic)
- See AGENTS.md for complete tool reference
Critical for OAR CLI:
OAR_JWK- Encryption key for config_store.json (from Bitwarden: openshift-qe-trt-env-vars)JIRA_TOKEN- Jira personal access tokenGCP_SA_FILE- Google Cloud service account credentials file path (optional for new releases using StateBox; required for old releases with Google Sheets)SLACK_BOT_TOKEN- Slack bot tokenJENKINS_USER/JENKINS_TOKEN- Jenkins credentialsGITLAB_TOKEN- GitLab personal access token- Kerberos ticket required:
kinit $kid@$domain
For Controllers/Agents:
GITHUB_TOKEN- GitHub API accessAPITOKEN- Prow/Gangway API tokenGCS_CRED_FILE- Google Cloud Storage credentials (for test artifacts)
For Slack Bot:
SLACK_APP_TOKEN- Slack app-level token (Socket Mode)SLACK_BOT_TOKEN- Slack bot token
See AGENTS.md for complete environment variable breakdown by component.
Centralized configuration management:
- Loads encrypted
config_store.jsonusing JWE (decrypted viaOAR_JWK) - Stores release-specific data: advisory IDs, Jira tickets, Google Sheet URLs, shipment MR URLs
- All OAR commands access configuration through ConfigStore
- Supports both Errata and Konflux workflow modes
The MCP server implements intelligent caching of ConfigStore instances for performance optimization:
Design:
- Scope: Per z-stream release (e.g., "4.19.1")
- TTL: 7 days (aligns with weekly release schedule)
- Max size: 50 entries with LRU eviction
- Thread-safe: Uses
RLockfor concurrent AI agent requests - Implementation: Built on
cachetools.TTLCache
Performance Impact:
- Cache miss (first access): ~1000ms
- JWE decryption: ~5-10ms
- GitHub HTTP request: ~300-800ms (major bottleneck)
- YAML parsing: ~10-50ms
- Cache hit (subsequent access): <10ms (3x-100x faster)
Why Caching is Needed: ConfigStore data is immutable after ART announces a release. Without caching, every MCP tool call pays the full initialization cost even for the same release. For typical AI agent workflows accessing the same release multiple times, this results in significant latency reduction.
Example Performance Gain:
Without cache (3 tool calls for same release):
- oar_get_release_metadata('4.19.1'): 1000ms
- oar_is_release_shipped('4.19.1'): 1000ms
- oar_get_release_status('4.19.1'): 1000ms
Total: ~3000ms
With cache (3 tool calls for same release):
- oar_get_release_metadata('4.19.1'): 1000ms (cache miss)
- oar_is_release_shipped('4.19.1'): <10ms (cache hit)
- oar_get_release_status('4.19.1'): <10ms (cache hit)
Total: ~1020ms (3x faster)
Cache Management Tools:
mcp_cache_stats()- View cache hit rate, size, and entriesmcp_cache_invalidate(release)- Manually refresh cache for specific releasemcp_cache_warm(releases)- Pre-populate cache before operations
Manual Invalidation:
Rarely needed since ConfigStore data is immutable. Only required if ART updates build data after initial announcement (exceptional case). Use mcp_cache_invalidate("4.19.1") to refresh.
Note: Caching is only used in MCP server, not in CLI commands (which are short-lived processes where caching provides no benefit).
StateBox is the primary state management system for new releases:
- GitHub-backed YAML storage at
_releases/{y-stream}/statebox/{release}.yaml - Tracks task status: "Not Started" → "In Progress" → "Pass" / "Fail"
- Records task execution results and timestamps
- Manages blocking/non-blocking issues with resolution tracking
- Automatic updates via
cli_result_callbackparsing command output
Backward Compatibility:
- Old releases (before StateBox migration) use Google Sheets test reports
- Commands automatically detect and use appropriate system
- Google Sheets still supported via
WorksheetManagerfor legacy releases
Task Status Logging:
All CLI commands use util.log_task_status() to output status markers:
- Logs format:
"task [{Display Name}] status is changed to [{Status}]" cli_result_callbackparses last line to auto-update StateBox- Ensures consistent status tracking without explicit StateBox calls
The system uses GitHub repository (_releases/ directory on record branch) for persistent state:
- Current build tracking files
- Test result JSON files
- Aggregation status markers
The ApprovalOperator implements background scheduler with:
- File-based locking (
/tmp/oar_scheduler_*.lock) - Periodic metadata URL checks (every 30 minutes)
- Timeout handling (2 days default)
- Logs in
/tmp/oar_logs/metadata_checker_*.log
# 1. Initialize release tracking
oar -r 4.19.1 create-test-report
# 2. Assign ownership
oar -r 4.19.1 take-ownership -e owner@redhat.com
# 3. Sync bug status (run multiple times during release)
oar -r 4.19.1 update-bug-list
# 4. Verify payload images
oar -r 4.19.1 image-consistency-check
oar -r 4.19.1 image-consistency-check -i <job-id> # Check status
# 5. Validate CVP tests
oar -r 4.19.1 check-greenwave-cvp-tests
# 6. Check CVE coverage
oar -r 4.19.1 check-cve-tracker-bug
# 7. Push to staging
oar -r 4.19.1 push-to-cdn-staging
# 8. Run stage tests
oar -r 4.19.1 stage-testing
oar -r 4.19.1 stage-testing -n <build-number> # Check status
# 9. Verify signatures
oar -r 4.19.1 image-signed-check
# 10. Clean up unverified bugs
oar -r 4.19.1 drop-bugs
# 11. Finalize and approve
oar -r 4.19.1 change-advisory-statusWorking with advisories: Start in oar/core/advisory.py → AdvisoryManager class
Understanding bug operations: Check oar/core/operators.py → BugOperator for cross-module orchestration
Modifying CLI commands: Look in oar/cli/ for Click command definitions
Job controller logic: prow/job/controller.py contains both JobController and TestResultAggregator
Notification logic: oar/core/notification.py → MessageHelper for message formatting templates
Shipment/GitLab workflow: oar/core/shipment.py → ShipmentData and GitLabMergeRequest
Currently supports OpenShift versions: 4.12.z through 4.20.z
When adding new version support, update:
- Jira query filters (
oar/notificator/jira_notificator.py) - Job registry configurations
- Test report templates
- Jenkins job parameters (stage-testing)
- Prow job configuration (image-consistency-check)
- ConfigStore config (test template doc ID, Slack group alias)
- Kerberos required for Errata Tool and LDAP access:
kinit $kid@$domain - Bugzilla credentials cached in
~/.config/python-bugzilla/bugzillarc - GitHub token needs
reposcope for private repositories - All tokens should be kept in secure storage (Bitwarden, environment variables)
- Missing
OAR_JWK: ConfigStore will fail to decrypt config - Expired Kerberos ticket: Errata Tool and LDAP operations will fail
- Stale lock files: Background processes may appear stuck - check
/tmp/oar_scheduler_*.lock - Version format: Release version must be z-stream format (e.g., 4.19.1, not 4.19)
ART Tools (installed from git):
artcommon- Common ART utilitiespyartcd- ART CD toolingrh-elliott- CVE tracker bug checkingrh-doozer- Build data management
These are installed automatically from openshift-eng/art-tools repository.
AGENTS.md- Comprehensive documentation of all agents, CLI commands, and core modulesREADME.md- Quick start and installationoar/README.md- Additional OAR command detailsdocs/- Additional documentation
When modifying OAR commands:
- Commands are defined in
oar/cli/ - Business logic should be in
oar/core/modules - Complex multi-module operations belong in
oar/core/operators.py - Always use
util.log_task_status()for status tracking (auto-updates StateBox via cli_result_callback) - Add proper exception handling using custom exceptions from
oar/core/exceptions.py - Use StateBox for explicit issue tracking when tasks detect blocking problems
When adding new integrations:
- Create new module in
oar/core/ - Follow Manager/Helper pattern
- Integrate with ConfigStore
- Add custom exception types
- Add unit tests in
tests/
For background processes:
- Use file-based locking to prevent duplicates
- Implement proper timeout handling
- Log to dedicated log files
- Clean up resources on exit