Evaluation Execution Commands

Complete command reference for running baseline and Atlas evaluations against the final clean dataset (1,200 conversations).

Prerequisites

Before running any evaluation, ensure you have completed the setup:

✅ Followed docs/SETUP_GUIDE.md completely
✅ Virtual environment activated
✅ All dependencies installed (including Atlas SDK with modification)
✅ PostgreSQL databases running (crm_sandbox and atlas)
✅ .env file configured with all API keys and database credentials
✅ Smoke tests passed (see Phase 1 below)

Quick Setup Check:

# Verify environment
source venv/bin/activate
set -a; source .env; set +a

# Verify dataset exists
ls -lh artifacts/deterministic/final_conversations_final_clean.jsonl

# Run smoke test (5 conversations)
python3 -m src.evaluation.run_baseline \
  --conversations artifacts/deterministic/final_conversations_final_clean.jsonl \
  --agent claude \
  --model claude-sonnet-4-5-20250929 \
  --backend postgres \
  --sample 5 \
  --output artifacts/evaluation/smoke_test.jsonl

Environment Setup

Ensure .env contains all required credentials:

# LLM API Keys (REQUIRED)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=...

# Postgres CRM Backend (REQUIRED)
DB_HOST=localhost
DB_PORT=5432
DB_NAME=crm_sandbox
DB_USER=crm_user
DB_PASSWORD=crm_password

# Atlas Storage (REQUIRED for Atlas evaluations)
STORAGE__DATABASE_URL=postgresql://atlas:atlas@localhost:5433/atlas

Load environment before every evaluation run:

set -a
source .env
set +a

Note: If you haven't completed setup, see docs/SETUP_GUIDE.md for complete instructions.

Phase 1: Smoke Tests (Verification)

1.1 Baseline Smoke Test - LLM Judge Verification

Purpose: Verify LLM judge evaluates task completion (goal achievement) not process matching.

Command:

python3 -m src.evaluation.run_baseline \
  --conversations artifacts/deterministic/final_conversations_final_clean.jsonl \
  --agent claude \
  --model claude-sonnet-4-5-20250929 \
  --backend postgres \
  --sample 5 \
  --output artifacts/evaluation/baseline_smoke_claude.jsonl \
  --temperature 0.0 \
  --max-output-tokens 800

Verification Steps:

Check output JSONL for judge_used, judge_pass, judge_score, judge_rationale fields
Verify judge rationale focuses on goal achievement, not exact tool matching
Confirm overall_success reflects task completion, not process adherence
Review sample judge evaluations manually
Verify progress logging shows every conversation with running success rate

Expected Output:

5 conversations executed
Judge used when exact match fails but execution succeeds
Judge scores reflect goal achievement
Success rate calculated from task completion
Progress logged for each conversation with running success rate

Progress Logging: Each conversation will be logged with format:

[1/5] Conversation: SKEL-... | Success: ✓ | Running Success Rate: 100.0% (1/1) | ETA: 00:02:30

1.2 Atlas Smoke Test - Learning Loop Verification

Purpose: Verify Atlas learning loop is active, learning persists, and judge evaluates task completion.

Command:

python3 scripts/evaluate_atlas_learning_loop.py

Note: This script runs 5 scenarios. Verify:

Learning state grows across scenarios
Learning persists to Postgres database
Learning re-injects into subsequent sessions
Judge evaluates based on task completion
Session rewards track correctly
Progress logged for each scenario with running success rate

Expected Output:

5 scenarios executed sequentially
Learning state increases across scenarios
Database verification passes
Learning re-injection verified
Progress logged for each scenario with running success rate

Progress Logging: Each scenario will display:

SCENARIO 2/5: SKEL-...
Running Success Rate: 50.0% (1/1)
...
Running Success Rate: 50.0% (1/2)

Phase 2: Full Baseline Evaluation

2.1 Claude 4.5 Sonnet Baseline

Command:

python3 -m src.evaluation.run_baseline \
  --conversations artifacts/deterministic/final_conversations_final_clean.jsonl \
  --agent claude \
  --model claude-sonnet-4-5-20250929 \
  --backend postgres \
  --output artifacts/evaluation/baseline_claude_sonnet_4_5.jsonl \
  --temperature 0.0 \
  --max-output-tokens 800

Progress Tracking:

Enhanced logging: Every conversation logged with running success rate
Real-time success rate calculation: successes/total * 100
Monitor judge usage: Count judge_used: true in results
ETA calculated based on current rate

Expected Duration: ~4-6 hours for 1,200 conversations

Output Format:

[1/1200] Conversation: SKEL-... | Success: ✓ | Running Success Rate: 100.0% (1/1) | ETA: 04:30:00
[2/1200] Conversation: SKEL-... | Success: ✗ | Running Success Rate: 50.0% (1/2) | ETA: 04:28:30
...

2.2 GPT-4.1 Baseline

Command:

python3 -m src.evaluation.run_baseline \
  --conversations artifacts/deterministic/final_conversations_final_clean.jsonl \
  --agent gpt4.1 \
  --model gpt-4.1 \
  --backend postgres \
  --output artifacts/evaluation/baseline_gpt4_1.jsonl \
  --temperature 0.0 \
  --max-output-tokens 800

Expected Duration: ~4-6 hours for 1,200 conversations

2.3 GPT-4.1 Mini Baseline

Command:

python3 -m src.evaluation.run_baseline \
  --conversations artifacts/deterministic/final_conversations_final_clean.jsonl \
  --agent gpt4.1 \
  --model gpt-4.1-mini \
  --backend postgres \
  --output artifacts/evaluation/baseline_gpt4_1_mini.jsonl \
  --temperature 0.0 \
  --max-output-tokens 800

Expected Duration: ~4-6 hours for 1,200 conversations

Phase 3: Atlas Evaluation

3.1 Full Atlas Run

Command:

python3 scripts/run_atlas_evaluation.py \
  --conversations artifacts/deterministic/final_conversations_final_clean.jsonl \
  --config configs/atlas/crm_harness.yaml \
  --output-dir artifacts/evaluation/atlas_full

Note: The wrapper script scripts/run_atlas_evaluation.py provides a CLI interface for the run_atlas_baseline function.

Progress Tracking:

Enhanced logging: Every conversation logged with running success rate
Monitor learning state growth across sessions
Track session rewards and success rates
Real-time success rate calculation displayed

Expected Duration: ~6-8 hours for 1,200 conversations (with learning overhead)

Output Files:

artifacts/evaluation/atlas_full/sessions.jsonl - All Atlas session results
artifacts/evaluation/atlas_full/metrics.json - Aggregated metrics
artifacts/evaluation/atlas_full/tasks.jsonl - Task payloads

Phase 4: Results Analysis

4.1 Analysis Command

Note: The current evaluation focuses on Baseline and Atlas Runtime phases. The "Atlas + GKD" phase mentioned in the case study (distillation with Atlas Core) is a future enhancement and not included in this evaluation run.

Command:

python3 scripts/analyze_evaluation_results.py \
  --baseline-claude artifacts/evaluation/baseline_claude_sonnet_4_5.jsonl \
  --baseline-gpt4 artifacts/evaluation/baseline_gpt4_1.jsonl \
  --baseline-gpt4mini artifacts/evaluation/baseline_gpt4_1_mini.jsonl \
  --atlas-sessions artifacts/evaluation/atlas_full/sessions.jsonl \
  --output-report artifacts/evaluation/evaluation_report.md \
  --output-json artifacts/evaluation/evaluation_summary.json

Output Files:

Console: Formatted summary printed to stdout
artifacts/evaluation/evaluation_report.md: Detailed markdown report with tables and analysis
artifacts/evaluation/evaluation_summary.json: JSON summary data for further processing

Metrics Calculated:

Task success rate (conversation-level)
Turn-level success rate
Judge usage statistics
Token usage and cost estimates
Atlas-specific: learning growth, reward trends, cue hits, action adoptions, token usage

Optional: Real-time Progress Monitoring

Monitor Baseline Evaluation

Terminal 1: Run evaluation

python3 -m src.evaluation.run_baseline \
  --conversations artifacts/deterministic/final_conversations_final_clean.jsonl \
  --agent claude \
  --model claude-sonnet-4-5-20250929 \
  --backend postgres \
  --output artifacts/evaluation/baseline_claude_sonnet_4_5.jsonl

Terminal 2: Monitor progress (optional)

python3 scripts/monitor_evaluation_progress.py \
  --input artifacts/evaluation/baseline_claude_sonnet_4_5.jsonl \
  --update-interval 5

Success Criteria

Smoke Tests

✅ Judge evaluates goal achievement, not process matching
✅ Atlas learning loop active and persisting
✅ Learning re-injects into subsequent sessions
✅ Progress logging shows every conversation with running success rate

Full Evaluation

✅ All 1,200 conversations executed for each baseline
✅ Atlas completes full dataset with learning accumulation
✅ Results analysis generates comprehensive comparison report
✅ Task success rates calculated correctly
✅ Token usage and cost estimates provided
✅ Progress logged for every conversation with running success rate

Timeline Estimate

Smoke tests: ~15 minutes (5 conversations × 3 baselines + Atlas 5 scenarios)
Full baseline evaluation: ~12-18 hours (1,200 conversations × 3 baselines)
Atlas evaluation: ~6-8 hours (1,200 conversations with learning overhead)
Results analysis: ~30 minutes
Total: ~18-26 hours

Crash Recovery & Resume Support

Both baseline and Atlas evaluations support automatic resume functionality:

Incremental Writing: Results are written immediately after each conversation completes
Automatic Resume: If a run crashes or is interrupted, simply re-run the same command - it will automatically detect existing results and skip already-processed conversations
Progress Preservation: Running success rates and ETAs account for previously completed conversations

How It Works:

On startup, the evaluation checks if the output file already exists
If it exists, loads existing results and identifies already-processed conversation IDs
Filters out completed conversations from the remaining work
Continues processing only the remaining conversations
Appends new results to the existing file

Example Resume Scenario:

# First run processes 500 conversations, then crashes
python3 -m src.evaluation.run_baseline \
  --conversations artifacts/deterministic/final_conversations_final_clean.jsonl \
  --agent claude \
  --model claude-sonnet-4-5-20250929 \
  --backend postgres \
  --output artifacts/evaluation/baseline_claude_sonnet_4_5.jsonl

# Re-run the same command - it will automatically resume from conversation 501
# Logs will show: "Found 500 existing results, will resume from remaining conversations"
python3 -m src.evaluation.run_baseline \
  --conversations artifacts/deterministic/final_conversations_final_clean.jsonl \
  --agent claude \
  --model claude-sonnet-4-5-20250929 \
  --backend postgres \
  --output artifacts/evaluation/baseline_claude_sonnet_4_5.jsonl

Important Notes:

Do NOT delete or modify the output file while a run is in progress
If you want to start fresh, delete the output file before running
Individual conversation failures are caught and logged, but don't stop the entire run
Results are flushed to disk after each conversation for maximum crash safety

Troubleshooting

Setup Issues

"ModuleNotFoundError: No module named 'atlas'"

Solution: Install Atlas SDK: pip install -e external/atlas-sdk[dev]
Verify: python3 -c "import atlas; print(atlas.__version__)"

"Database connection failed"

Solution: Verify PostgreSQL is running (docker ps or pg_isready)
Check .env credentials match your database
Verify databases exist: psql -l | grep -E "crm_sandbox|atlas"

"STORAGE__DATABASE_URL not found"

Solution: Verify .env has STORAGE__DATABASE_URL set
Verify Atlas SDK modification was applied (see docs/SETUP_GUIDE.md Step 4)
Reload environment: set -a; source .env; set +a

"Dataset file not found"

Solution: Verify dataset exists: ls artifacts/deterministic/final_conversations_final_clean.jsonl
Check you're in repository root directory
Verify branch has dataset: git log --oneline --all -- artifacts/deterministic/

Runtime Issues

Judge Not Evaluating Correctly

Check OPENAI_API_KEY is set and valid
Verify judge is enabled: use_llm_judge=True (default)
Review judge rationale in output JSONL
Check API quota/credits available

Atlas Learning Not Persisting

Verify STORAGE__DATABASE_URL is correct in .env
Check Atlas database connection (see docs/SETUP_GUIDE.md Step 8.2)
Verify Atlas SDK modification applied (see docs/SETUP_GUIDE.md Step 4)
Review learning state queries in logs
Check database schema initialized: psql -d atlas -c "\dt"

Progress Logging Not Showing

Ensure latest code with enhanced logging is used
Check log level is INFO or DEBUG
Verify output file is being written
Check terminal supports Unicode (for ✓/✗ symbols)

Evaluation Running Slowly

Check API rate limits (OpenAI, Anthropic, Gemini)
Verify database connection pooling
Monitor system resources (CPU, memory, disk I/O)
Consider running evaluations in parallel on separate machines

UUID Serialization Errors

Verify UUID serialization fix is in place (should be in latest code)
Check src/evaluation/llm_judge.py has _serialize_for_json method
Ensure all UUIDs are converted to strings before JSON serialization

Multiple Run Management

Running Multiple Evaluations in Parallel

Use separate output directories for each run
Use separate PostgreSQL databases or ensure transaction isolation
Monitor API rate limits across parallel runs
Tag each run with unique identifier (timestamp, run number)

Organizing Multiple Evaluation Runs

# Create run-specific directories
mkdir -p artifacts/evaluation/run_$(date +%Y%m%d_%H%M%S)

# Use run-specific outputs
--output artifacts/evaluation/run_20251111_001/baseline_claude.jsonl
--output-dir artifacts/evaluation/run_20251111_001/atlas_full

# Document each run
echo "Run parameters: ..." > artifacts/evaluation/run_20251111_001/README.md

Additional Resources

Complete Setup Guide: docs/SETUP_GUIDE.md - Step-by-step setup instructions
Atlas Integration Details: docs/atlas_integration.md - Atlas-specific configuration
Repository README: README.md - General repository information

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Execution Commands

Prerequisites

Environment Setup

Phase 1: Smoke Tests (Verification)

1.1 Baseline Smoke Test - LLM Judge Verification

1.2 Atlas Smoke Test - Learning Loop Verification

Phase 2: Full Baseline Evaluation

2.1 Claude 4.5 Sonnet Baseline

2.2 GPT-4.1 Baseline

2.3 GPT-4.1 Mini Baseline

Phase 3: Atlas Evaluation

3.1 Full Atlas Run

Phase 4: Results Analysis

4.1 Analysis Command

Optional: Real-time Progress Monitoring

Monitor Baseline Evaluation

Success Criteria

Smoke Tests

Full Evaluation

Timeline Estimate

Crash Recovery & Resume Support

Troubleshooting

Setup Issues

Runtime Issues

Multiple Run Management

Additional Resources

FilesExpand file tree

evaluation_execution_commands.md

Latest commit

History

evaluation_execution_commands.md

File metadata and controls

Evaluation Execution Commands

Prerequisites

Environment Setup

Phase 1: Smoke Tests (Verification)

1.1 Baseline Smoke Test - LLM Judge Verification

1.2 Atlas Smoke Test - Learning Loop Verification

Phase 2: Full Baseline Evaluation

2.1 Claude 4.5 Sonnet Baseline

2.2 GPT-4.1 Baseline

2.3 GPT-4.1 Mini Baseline

Phase 3: Atlas Evaluation

3.1 Full Atlas Run

Phase 4: Results Analysis

4.1 Analysis Command

Optional: Real-time Progress Monitoring

Monitor Baseline Evaluation

Success Criteria

Smoke Tests

Full Evaluation

Timeline Estimate

Crash Recovery & Resume Support

Troubleshooting

Setup Issues

Runtime Issues

Multiple Run Management

Additional Resources