A multi-agent LLM orchestration system that reads structured requirements from project management platforms, generates validated structured outputs through a hybrid rule-engine + LLM pipeline, enforces coverage via automated feedback loops, and self-corrects using RAG-powered semantic matching.
Built with Clean Architecture, multi-provider LLM support (OpenAI, Gemini, Anthropic, Ollama), ChromaDB vector search, and a Model Context Protocol (MCP) server for IDE integration.
- AI Engineering Overview
- System Architecture
- Technical Highlights
- Quick Start (5 minutes)
- Docker Setup (Recommended)
- CLI Commands
- Adding Your Project
- Configuration Reference
- Output Files
- Bug Creation
- Board Story Report
- AC Coverage Validation
- ChromaDB Semantic Matching
- MCP Integration (GitHub Copilot)
- Architecture
- Troubleshooting
This system solves a real-world problem — generating comprehensive, validated test cases from natural-language requirements — using a multi-stage AI pipeline rather than a single LLM call.
Manually writing test cases from user stories is slow, inconsistent, and prone to coverage gaps. A single LLM prompt produces hallucinated steps, inconsistent wording, and misses edge cases.
Instead of relying on a single LLM call, the system orchestrates multiple specialized stages — each with its own responsibility — combining deterministic rule engines with LLM intelligence:
Structured Input (ADO/Jira)
│
▼
┌─────────────────────────────┐
│ 1. INGESTION & PARSING │ Platform adapters (ADO, Jira) → domain models
│ NLP analysis (spaCy) │ Story type classification, feature detection
└─────────────┬───────────────┘
▼
┌─────────────────────────────┐
│ 2. DETERMINISTIC GENERATION│ Rule engine: 70+ QA rules, scenario expansion,
│ (No LLM — pure logic) │ edge case generation, platform-specific tests
└─────────────┬───────────────┘
▼
┌─────────────────────────────┐
│ 3. RAG: SEMANTIC MATCHING │ ChromaDB vector search (all-MiniLM-L6-v2)
│ Reference step retrieval│ Retrieve similar steps → enforce consistency
└─────────────┬───────────────┘
▼
┌─────────────────────────────┐
│ 4. LLM CORRECTION │ Multi-provider (OpenAI/Gemini/Anthropic/Ollama)
│ Structured JSON output │ Dynamic prompt construction, JSON schema enforcement
└─────────────┬───────────────┘
▼
┌─────────────────────────────┐
│ 5. VALIDATION & FEEDBACK │ AC coverage gap detection → targeted re-generation
│ Self-correction loop │ Quality gates, forbidden language, structural fixes
└─────────────┬───────────────┘
▼
┌─────────────────────────────┐
│ 6. MULTI-FORMAT EXPORT │ CSV (ADO), Playwright scripts, JSON, QA summaries
│ Platform upload │ ADO test suites, TestRail, MCP server
└─────────────────────────────┘
| Decision | Rationale |
|---|---|
| Hybrid (rules + LLM) instead of pure LLM | Rule engine handles 70% deterministically — LLM refines the remaining 30%. Reduces hallucination, cuts token cost, ensures structural correctness |
| RAG for consistency instead of stateless prompts | ChromaDB stores previously generated steps. New generations retrieve similar steps as few-shot context, producing consistent wording across runs |
| Coverage validation loop instead of single-pass | After generation, the system extracts keywords from each acceptance criterion and checks coverage. Uncovered ACs trigger a targeted LLM call to fill gaps |
| Multi-provider factory instead of hardcoded provider | Factory pattern + YAML config = swap between OpenAI, Gemini, Anthropic, or local Ollama without code changes |
| Structured output enforcement instead of free-text | JSON schema in prompts, response_mime_type for Gemini, truncated JSON repair for robustness |
┌──────────────────────────┐
│ CLI / MCP Server │ Entry points
│ (workflows.py) │ (Typer CLI + MCP)
└────────────┬─────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Generate │ │ Upload │ │ Bug Report │ Workflow
│ Workflow │ │ Workflow │ │ Workflow │ Layer
└──────┬───────┘ └──────┬───────┘ └──────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────┐
│ CORE SERVICES │
│ │
│ ┌─────────────┐ ┌──────────────────────┐ │
│ │ Test │ │ LLM Orchestration │ │
│ │ Generator │ │ │ │
│ │ (rules + │ │ PromptBuilder │ │
│ │ NLP + │──│ LLMCorrector │ │
│ │ scenarios) │ │ Provider Factory │ │
│ └─────────────┘ │ ┌────┬────┬────┐ │ │
│ │ │GPT │Gem │Anth│ │ │
│ ┌─────────────┐ │ │ │ini │ropi│ │ │
│ │ Embeddings │ │ │ │ │c │ │ │
│ │ (ChromaDB │──│ └────┴────┴────┘ │ │
│ │ RAG) │ └──────────────────────┘ │
│ └─────────────┘ │
│ ┌──────────────────────┐ │
│ ┌─────────────┐ │ Quality Gates │ │
│ │ AC Coverage │──│ Validator │ │
│ │ Validator │ │ Linters │ │
│ └─────────────┘ └──────────────────────┘ │
└─────────────────────────────────────────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ ADO │ │ Jira │ │ TestRail │ Infrastructure
│ Adapter │ │ Adapter │ │ Adapter │ Layer
└────────────┘ └────────────┘ └────────────┘
LLM Orchestration
- Multi-provider factory pattern: OpenAI, Google Gemini, Anthropic, Ollama — swappable via YAML config
- Dynamic prompt construction: context-aware prompts built from project config, feature type, and RAG results
- Structured JSON output with schema enforcement and truncated JSON repair
- Response caching (
MemoryCache,FileCache) to minimize redundant API calls - Cost tracking via
MetricsCollectorandCostCalculator
RAG Pipeline (ChromaDB)
- Sentence embeddings via
all-MiniLM-L6-v2(384-dim) for semantic step matching - Persistent vector store with distance-based similarity threshold (< 1.5)
- Retrieved reference steps injected as few-shot context into LLM correction prompts
- Feedback loop: each generation embeds new steps → future queries return richer context
Self-Correction & Validation
- AC coverage validation: keyword extraction from acceptance criteria → coverage check → targeted gap-filling LLM call for uncovered ACs
- Quality gates: 70+ rules (forbidden language, structural integrity, ID sequencing, accessibility compliance)
- Iterative correction: rule-based pre-pass → LLM refinement → post-validation
NLP & Feature Intelligence
- spaCy-based semantic parsing for acceptance criteria analysis
- Multi-label feature type classification (input, navigation, display, object manipulation, calculation)
- Story type classification (Tool, Dialog, Menu, File Operations) for context-aware generation
- Entry point auto-detection: maps features to correct UI locations
Software Engineering
- Clean Architecture: interfaces (
core/interfaces/), domain models, use cases, infrastructure adapters - Repository pattern with platform-agnostic factories (ADO, Jira, TestRail)
- Dependency injection via project configuration (YAML → dataclasses)
- MCP server exposing all workflows to GitHub Copilot / Claude Code
- Docker support for reproducible environments
- Playwright test script generation (LLM-based with deterministic fallback)
The sections below describe the QA domain this system operates in — how it's used, configured, and integrated with Azure DevOps.
This framework automatically generates comprehensive test cases by:
- Reading user stories and acceptance criteria from Azure DevOps (or Jira)
- Understanding your application context from project configuration (YAML)
- Generating test cases using a hybrid rule-engine + LLM pipeline
- Matching against previously generated steps via ChromaDB for consistent wording
- Correcting test quality with LLM enhancement (structural fixes, forbidden language, accessibility)
- Validating AC coverage — auto-detects gaps and generates missing tests
- Exporting to ADO-compatible CSV format or uploading directly
- Project-agnostic: Works with any application (desktop, web, mobile, hybrid)
- Multi-provider AI: Supports OpenAI, Gemini, Anthropic, and Ollama for test generation
- Context-aware: Generates relevant tests based on feature type (no input tests for menus!)
- ChromaDB semantic matching: Reference steps from previous generations ensure consistent wording
- AC coverage validation: Automatically detects missing acceptance criteria coverage and generates gap-filling tests
- Multi-platform: Generates accessibility tests for all supported platforms (Windows 11, iPad, Android Tablet)
- ADO Integration: Direct upload to Azure DevOps test suites + bug creation + board reporting
# Clone the repository
git clone <repository-url>
cd test_gen
# Create virtual environment (Python 3.10 required)
python3.10 -m venv venv310
source venv310/bin/activate # On Windows: venv310\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Copy example environment file
cp .env.example .env
# Edit .env with your credentialsRequired environment variables:
# Azure DevOps (Required)
ADO_PAT=your_personal_access_token_here
# LLM Provider (at least one required for LLM correction)
OPENAI_API_KEY=sk-your-api-key-here # OpenAI
GEMINI_API_KEY=your-gemini-key-here # Google Gemini (alternative)
ANTHROPIC_API_KEY=your-anthropic-key-here # Anthropic (alternative)# List available projects
python workflows.py list-projects
# Generate tests for a story
python workflows.py generate --story-id 272889Output: Test cases saved to output/ folder as CSV, JSON, and objectives files.
Docker provides a consistent environment across all team members - no Python version conflicts or dependency issues.
- Install Docker Desktop
- Get the
.envfile (contains ADO/OpenAI credentials)
# 1. Build the image (first time only, ~3 min)
docker build -t test-gen:v1 .
# 2. Verify it works
docker run test-gen:v1 --help
# 3. Run with credentials and output volume
docker run --env-file .env -v $(pwd)/output:/app/output \
test-gen:v1 generate --story-id 272889# Generate test cases
docker run --env-file .env -v $(pwd)/output:/app/output \
test-gen:v1 generate --story-id 272889
# Generate + Upload to ADO (dry run)
docker run --env-file .env -v $(pwd)/output:/app/output \
test-gen:v1 upload --story-id 272889 --dry-run
# Generate + Upload to ADO (live)
docker run --env-file .env -v $(pwd)/output:/app/output \
test-gen:v1 upload --story-id 272889
# List projects
docker run test-gen:v1 list-projects
# Update objectives
docker run --env-file .env -v $(pwd)/output:/app/output \
test-gen:v1 update-objectives --story-id 272889Add to your ~/.bashrc or ~/.zshrc:
alias testgen='docker run --env-file .env -v $(pwd)/output:/app/output test-gen:v1'
# Then simply run:
# testgen generate --story-id 272889
# testgen upload --story-id 272889 --dry-run| Aspect | Docker | Local Python |
|---|---|---|
| Setup time | ~3 min (one command) | ~10 min (venv, pip, spacy model) |
| Works on | Any machine with Docker | Requires Python 3.10 |
| Dependencies | Isolated in container | May conflict with other projects |
| Team consistency | Identical for everyone | "Works on my machine" issues |
# If you modify the code, rebuild the image
docker build -t test-gen:v1 .
# Docker caches layers, so rebuilds are fast if only code changedAll commands use: python workflows.py <command> [options]
For detailed CLI documentation, see docs/CLI_REFERENCE.md.
| Command | Description | Example |
|---|---|---|
list-projects |
Show all configured projects | python workflows.py list-projects |
generate |
Generate test cases locally | python workflows.py generate --story-id 272889 |
upload |
Generate AND upload to ADO | python workflows.py upload --story-id 272889 |
upload-existing |
Upload existing tests to ADO | python workflows.py upload-existing --story-id 272889 |
init-project |
Create new project config | python workflows.py init-project --name "MyApp" |
discover |
Auto-discover project settings | python workflows.py discover --story-ids 123 456 |
create-bug |
Create formatted ADO bug report | python workflows.py create-bug --file bugs/my_bug.txt |
# 1. List available projects
python workflows.py list-projects
# 2. Generate tests (saves to output/ folder)
python workflows.py generate --story-id 272889
# 3. Generate tests for specific project
python workflows.py generate --story-id 272889 --project mediapedia-us
# 4. Generate tests WITHOUT LLM correction (faster)
python workflows.py generate --story-id 272889 --skip-correction
# 5. Generate AND upload to Azure DevOps
python workflows.py upload --story-id 272889
# 6. Preview upload without actually uploading (dry run)
python workflows.py upload --story-id 272889 --dry-run
# 7. Upload EXISTING tests (skip generation, use files in output/)
python workflows.py upload-existing --story-id 272889
# 8. Initialize a new project
python workflows.py init-project --name "MyApp" --org myorg --ado-project MyProject
# 9. Create a bug report (local preview)
python workflows.py create-bug --file bugs/my_bug.txt
# 10. Create a bug report with dry-run (preview ADO fields without uploading)
python workflows.py create-bug --file bugs/my_bug.txt --upload --dry-run
# 11. Create a bug report and upload to ADO
python workflows.py create-bug --file bugs/my_bug.txt --upload
# 12. Create a bug and link to parent story
python workflows.py create-bug --file bugs/my_bug.txt --upload --story-id 272261cp projects/configs/example-web-app.yaml projects/configs/my-project.yamlOpen projects/configs/my-project.yaml and customize:
project_id: my-project # Unique identifier
application:
name: My Application
description: Description of your app
type: web # Options: desktop, web, mobile, hybrid
# Test step templates (use {app_name} as placeholder)
prereq_template: "Pre-req: User is logged into {app_name}"
launch_step: "Navigate to {app_name} homepage."
launch_expected: "Homepage loads with navigation menu visible."
close_step: "Log out from {app_name}"
# UI areas in your application (used for test titles)
ui_surfaces:
- Dashboard
- Navigation Menu
- Settings Page
- User Profile
- Modal Dialog
# How users access features (keyword -> UI location)
entry_point_mappings:
search: Navigation Menu
settings: User Profile
export: Dashboard
import: Dashboard
# Platforms your app supports (generates accessibility tests)
platforms:
- Windows 11
- Chrome (macOS)
- Safari (iOS)
# IMPORTANT: Features your app does NOT support
# Prevents generating impossible test scenarios
unavailable_features:
- offline mode
- multi-select
- bulk delete
# Azure DevOps settings
ado:
organization: your-org
project: YourProject
area_path: "YourProject\\QA Team"
assigned_to: qa-engineer@company.com
default_state: Design
# Test generation rules
rules:
forbidden_words:
- "or / OR"
- "if available"
- "if supported"
allowed_areas:
- Dashboard
- Navigation Menu
- Settings Page
# LLM settings (provider options: openai, gemini, anthropic, ollama)
llm_enabled: true
llm_provider: gemini # or openai, anthropic, ollama
llm_model: gemini-2.0-flash # model name for chosen providerIn .env, set your project as default:
DEFAULT_PROJECT=my-project# Verify project loads correctly
python workflows.py list-projects
# Test with a story
python workflows.py generate --story-id 123456 --project my-project| Type | Description | Example |
|---|---|---|
desktop |
Native desktop apps (Windows, macOS) | CAD software, IDEs |
web |
Browser-based applications | SaaS platforms, dashboards |
mobile |
iOS/Android apps | Mobile banking, social apps |
hybrid |
Cross-platform/enterprise apps | CRM systems, enterprise tools |
The framework generates platform-specific accessibility tests:
| Platform | Accessibility Tool | Test Type |
|---|---|---|
| Windows 11 | Accessibility Insights | Keyboard navigation |
| macOS | VoiceOver | Keyboard + screen reader |
| iPad/iOS | VoiceOver | Swipe gestures |
| Android | Accessibility Scanner | Touch + TalkBack |
| Chrome/Web | Screen reader (NVDA/JAWS) | ARIA, keyboard |
The framework supports multiple LLM providers via a factory pattern. Set llm_provider in your YAML config or .env:
| Provider | Config Value | Env Variable | Models |
|---|---|---|---|
| OpenAI | openai |
OPENAI_API_KEY |
gpt-4o-mini, gpt-4o |
| Google Gemini | gemini |
GEMINI_API_KEY |
gemini-2.0-flash, gemini-1.5-pro |
| Anthropic | anthropic |
ANTHROPIC_API_KEY |
claude-sonnet-4-5-20250929 |
| Ollama (local) | ollama |
N/A | Any local model |
YAML config llm_provider takes precedence over .env defaults. API keys are always resolved from environment variables.
The framework automatically detects feature types and generates appropriate tests:
| Feature Type | Generates | Does NOT Generate |
|---|---|---|
| Navigation (menus) | Visibility, keyboard access | Input validation, boundaries |
| Input (forms) | Validation, boundaries, errors | N/A |
| Display (viewers) | Content display, formatting | Input tests |
| Object manipulation | Undo/redo, state changes | Multi-select (if unavailable) |
After running generate, files are saved to output/:
| File | Description | Use Case |
|---|---|---|
*_HYBRID_TESTS.csv |
ADO-compatible test cases | Import to Azure DevOps |
*_HYBRID_OBJECTIVES.txt |
Test objectives with HTML | Copy to test case objectives |
*_HYBRID_DEBUG.json |
Full generation data | Debugging, review |
The CSV follows Azure DevOps import format:
| Column | Description |
|---|---|
| ID | Leave empty (ADO assigns) |
| Work Item Type | Always "Test Case" |
| Title | {StoryID}-{TestID}: Feature / Area / Scenario |
| TestStep | Step number (1, 2, 3...) |
| Step Action | What to do |
| Step Expected | Expected result |
| Area Path | ADO area path |
| AssignedTo | QA engineer email |
| State | Default: "Design" |
Create formatted ADO Bug work items from structured .txt files following the ENV Drawing Bug Template.
Create a .txt file (see bugs/sample_bug.txt for a complete example):
TITLE: DRAW: Feature Name / Brief Description
SEVERITY: 2 - High
STORY_ID: 272261
ISSUE: One sentence describing what is wrong.
ADDITIONAL_INFO:
- Regression from build 3.2.1
- WCAG 2.1 AA 1.3.1 (for accessibility bugs)
ATTACHMENTS:
- screenshot.png
- video.mp4
STEPS:
1. Launch the ENV QuickDraw application.
2. Navigate to the affected area.
3. Perform the action.
a. Observation text << NOT EXPECTED (see attached screenshot.png)
i. Expected: What should happen instead.
ii. Expected: Additional expected behavior.
SYSTEM_INFO:
- OS: Windows 11 Pro 23H2
- App Version: ENV QuickDraw 3.2.4
| Type | Format |
|---|---|
| Normal bug | DRAW: Feature / Brief Description |
| WCAG/Accessibility | DRAW: WCAG Accessibility Errors / Feature / Error |
# Preview locally (saves HTML to output/)
python workflows.py create-bug --file bugs/my_bug.txt
# Dry run — preview what would be uploaded without creating in ADO
python workflows.py create-bug --file bugs/my_bug.txt --upload --dry-run
# Upload to ADO (creates Bug work item, returns URL)
python workflows.py create-bug --file bugs/my_bug.txt --upload
# Upload and link to parent story
python workflows.py create-bug --file bugs/my_bug.txt --upload --story-id 272261The formatter produces ADO HTML matching the ENV Drawing Bug Template:
- ISSUE: — One sentence, same line as heading
- ADDITIONAL INFORMATION: — WCAG refs, regression notes
- SUPPORTING DOCUMENTATION PROVIDED: — Bulleted attachment filenames
- RECREATE STEPS: — Numbered steps with
<< NOT EXPECTEDmarker (yellow highlight) - TRIAGE/CAUSE INFORMATION: — Empty (for development)
- FIX SUMMARY: — Empty (for development)
Generate a CSV summary of all user stories from specific ADO board columns, with test case counts.
python scripts/fetch_board_stories.pyOutput: output/board_stories_summary.csv with columns:
| Column | Description |
|---|---|
| User Story Title | Story ID and title |
| # Test Cases | Count of linked test cases (via TestedBy relations + test suites) |
| Tablet Testing Needed | Left empty for dev team to fill in |
The script queries stories from Most Wanted, Development, and Quality Assurance board columns, filtered by area path. Excludes [Out of Scope] stories.
The LLM correction pipeline automatically validates that every acceptance criterion (AC) has at least one test case covering it. If gaps are detected, it generates targeted gap-filling tests.
- Keyword extraction from each AC (strips stop words, punctuation)
- Keyword matching against test case text (title + objective + steps) at 40% threshold with minimum 2 keyword hits
- Gap detection — ACs with zero matching test cases are flagged
- Targeted LLM call — generates 1-2 tests per uncovered AC
- Structural fixes — ensures generated tests have PRE-REQ, launch, close steps
AC coverage: All 12 ACs covered by existing tests
Or when gaps are found:
Warning: 1 AC(s) have no test coverage:
AC 11: Undo/Redo applies to rename, visibility, lock, order, delete (limit 50)
→ Generating tests for 1 uncovered AC(s)...
→ Added 3 gap-filling test(s)
This runs automatically during generate and upload workflows. No extra flags needed.
Previously generated test steps are stored in ChromaDB (vector database) and used as reference during LLM correction. This ensures consistent wording across test generations for the same story.
- Auto-embeds using
all-MiniLM-L6-v2sentence transformer - Distance metric: lower = more similar (0.2 very similar, 1.5+ less similar)
- Persistent storage in
./db/folder - To regenerate cleanly, delete the story's steps from ChromaDB before re-running
Use the framework directly from GitHub Copilot Chat.
- Open
.vscode/mcp.jsonin VS Code - Click "Start" button to launch MCP server
- In Copilot Chat, use Agent mode (@workspace)
"generate tests for story 272889"
"upload tests for story 272889"
"check story 272889"
"create bug from bugs/my_bug.txt"
"list projects"
Edit .vscode/mcp.json:
{
"servers": {
"test-gen": {
"type": "stdio",
"command": "/path/to/venv310/bin/python",
"args": ["/path/to/integrations/mcp_server.py"]
}
}
}The project follows Clean Architecture principles with all business logic centralized in core/.
test_gen/
├── workflows.py # Main CLI entry point
├── requirements.txt # Python dependencies
├── .env # Environment configuration
│
├── projects/ # Multi-project support
│ ├── configs/ # YAML project configurations
│ │ ├── env-quickdraw.yaml
│ │ └── example-web-app.yaml
│ ├── project_config.py # Configuration loader
│ └── project_manager.py # Project management
│
├── bugs/ # Bug report input files
│ └── sample_bug.txt # Example bug template
│
├── core/ # ALL business logic (Clean Architecture)
│ ├── application/use_cases/ # Use case implementations
│ │ ├── bug_parser.py # Parse .txt bug files
│ │ └── bug_formatter.py # Format bugs to ADO HTML
│ ├── config/ # App configuration
│ │ └── environment.py # Environment variables
│ ├── domain/ # Domain models
│ │ ├── models.py # UserStory, TestCase, etc.
│ │ └── bug_report.py # BugReport, RecreateStep
│ ├── interfaces/ # Contracts (protocols)
│ │ ├── llm_provider.py # ILLMProvider interface
│ │ ├── repository.py # IStoryRepository, ITestSuiteRepository, etc.
│ │ └── vector_store.py # IVectorStore interface
│ └── services/ # ALL services centralized here
│ ├── test_generator.py # Main test generation
│ ├── objective_service.py # Objective generation
│ ├── summary_service.py # QA summary generation
│ ├── test_validator.py # QualityGate validation
│ ├── embeddings/ # Vector embeddings
│ │ └── test_step_embedder.py # ChromaDB step embedding
│ ├── nlp/ # NLP parsing (spaCy)
│ │ ├── spacy_parser.py
│ │ └── hybrid_parser.py
│ ├── quality/ # Quality analysis
│ │ ├── quality_analyzer.py
│ │ └── test_corrector.py
│ ├── linting/ # Evidence-based linting
│ │ ├── summary_linter.py
│ │ └── objective_linter.py
│ └── llm/ # LLM providers & prompts
│ ├── corrector.py # LLM correction + AC coverage validation
│ ├── prompt_builder.py # Dynamic prompt generation
│ ├── factory.py # LLM provider factory
│ ├── openai_provider.py
│ ├── gemini_provider.py
│ └── anthropic_provider.py
│
├── infrastructure/ # External services (adapters)
│ ├── ado/ # Azure DevOps client
│ │ ├── http_client.py # Low-level ADO HTTP client
│ │ ├── ado_repository.py # ADO API wrapper (stories, test cases, suites)
│ │ └── ado_bug_repository.py # ADO Bug creation
│ ├── vector_db/ # Vector database
│ │ └── chroma_repository.py # ChromaDB implementation
│ └── export/ # Export generators
│ ├── csv_generator.py
│ └── objective_generator.py
│
├── integrations/ # External tool integrations
│ └── mcp_server.py # GitHub Copilot MCP server
│
├── scripts/ # Utility scripts
│ └── fetch_board_stories.py # ADO board story report
│
├── tests/ # Unit & integration tests
│ ├── unit/
│ └── integration/
│
└── output/ # Generated files
└── *.csv, *.json, *.txt
- Verify story ID exists in Azure DevOps
- Check
ADO_PATtoken has read permissions - Ensure ADO organization/project in config matches story location
- Add
OPENAI_API_KEY=sk-...to.envfile - Ensure key is valid and has credits
- Run
python workflows.py list-projectsto see available projects - Check YAML file exists in
projects/configs/ - Verify
project_idin YAML matches what you're using
- Use
--skip-correctionflag for faster generation (lower quality) - Consider using
gpt-4o-miniinstead ofgpt-4oin config - Gemini
gemini-2.0-flashis a fast, cost-effective alternative
- Gemini free tier has strict daily rate limits
- Upgrade to a paid API key or switch to
openaiin your YAML config
- Bad reference steps from previous generations can affect new outputs
- Delete the
./db/folder to clear all stored embeddings, or - Re-generate the story to overwrite stale references
- Ensure VS Code version is 1.102+
- Open
.vscode/mcp.jsonand click "Start" - Use Agent mode in Copilot Chat
- Try:
@workspace generate tests for story 272889
# Check Python version (3.10 required for spaCy)
python --version
# If wrong version, create venv with specific Python
python3.10 -m venv venv310# Ensure venv is activated
source venv310/bin/activate
# Reinstall dependencies
pip install -r requirements.txt"Cannot connect to Docker daemon"
- Ensure Docker Desktop is running (check system tray/menu bar)
"No such file: .env"
- Copy
.envfile to project root:cp /path/to/.env . - Never commit
.envto git
"Permission denied" on output folder
sudo chown -R $(whoami) output/Need to debug inside container?
docker run -it --entrypoint /bin/bash test-gen:v1
# Now you're inside the containerContainer runs but can't connect to ADO
- Verify
.envhas correctADO_PATvalue - Check
--env-file .envflag is included in command
For issues or questions:
- Check the Troubleshooting section
- Review example configurations in
projects/configs/ - Check existing test output in
output/for reference
v6.0 — Phase 2: Intelligent Coverage & Semantic Matching
- AC coverage validation — automatically detects missing acceptance criteria coverage and generates gap-filling tests via targeted LLM call
- ChromaDB semantic matching — stores previously generated test steps as reference embeddings for consistent wording across generations
- Gemini provider — Google Gemini (Flash + Pro) support via
google-genaiSDK with JSON response mode - LLM factory pattern — provider-agnostic architecture (OpenAI, Gemini, Anthropic, Ollama) with YAML config override
- Board story report (
scripts/fetch_board_stories.py) — fetches user stories from ADO board columns with test case counts - Enhanced LLM corrector — structural fixes (PRE-REQ, launch, close steps), forbidden language cleanup, accessibility test auto-generation
- Bug creation command (
create-bug) — create ADO Bug work items from structured.txtfiles - Multi-provider LLM — initial OpenAI and Anthropic support
- Anti-hallucination guardrails — LLM-generated tests are grounded in acceptance criteria
--dry-runflag for bug creation — preview without uploading to ADO- Docker support for consistent team environments
- MCP integration — use from GitHub Copilot Chat
- Project-agnostic framework with YAML configuration
- Enhanced LLM prompts (expert QA engineer persona)
update-objectivesworkflow now fetches directly from ADO (no CSV required)