AI Agent Evaluation Framework Integration

This document describes the changes made to enable this codebase to work with AI agent evaluation frameworks directly as an SDK instead of over API proxy.

Summary of Changes

1. All SQLAlchemy Models Have Pydantic Serialization

What Changed:

Added PydanticMixin to all base classes

2. Typed Operations Wrapper (Class-Based API)

What Changed:

Created BoxOperations class that encapsulates session management
AI agents don't need to pass session to every function call
Fixed parameter names to match actual operations.py signatures

Files Created:

src/services/box/database/typed_operations.py - Type-safe wrapper for Box operations

Usage:

from services.box.database.typed_operations import BoxOperations
from sqlalchemy.orm import Session

# Initialize once with session
ops = BoxOperations(session)

# AI agents call methods without passing session
user = ops.create_user(
    name="John Doe",
    login="john@example.com",
    job_title="Engineer"
)

folder = ops.create_folder(
    name="Reports",
    parent_id="0",
    user_id="user-123"
)

# All returned models have Pydantic serialization
folder_data = folder.model_dump()  # Returns dict for AI agent

For AI Agent SDKs:

# Anthropic Claude SDK
tools = [
    {
        "name": "create_folder",
        "description": "Create a new folder in Box",
        "input_schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string", "description": "Folder name"},
                "parent_id": {"type": "string", "description": "Parent folder ID"},
                "user_id": {"type": "string", "description": "Owner user ID"}
            },
            "required": ["name", "parent_id", "user_id"]
        }
    }
]

# Tool implementation
def execute_tool(tool_name, arguments):
    ops = BoxOperations(session)

    if tool_name == "create_folder":
        folder = ops.create_folder(**arguments)
        return folder.model_dump()  # Serialize for agent

3. State Management Utilities for Evals

What Changed:

Created unified interface for seeding, clearing, and diffing database state
Utilities work across all services

Files Created:

src/eval_platform/eval_utilities.py - State management for evals

Usage:

Taking Snapshots (Before/After Agent Actions)

from eval_platform.eval_utilities import create_snapshot, get_diff

# Before agent runs
before_snapshot = create_snapshot(session, "test_env", "before")

# Agent executes actions...
ops = BoxOperations(session)
folder = ops.create_folder(name="Reports", parent_id="0", user_id="user-123")

# After agent runs
after_snapshot = create_snapshot(session, "test_env", "after")

# Get diff
diff = get_diff(session_manager, "test_env", "before", "after")

# Analyze changes
print(f"Inserts: {len(diff.inserts)}")  # New rows
print(f"Updates: {len(diff.updates)}")  # Modified rows
print(f"Deletes: {len(diff.deletes)}")  # Deleted rows

# Each change includes __table__ to identify origin
for insert in diff.inserts:
    print(f"Created {insert['__table__']}: {insert['name']}")

Using EvalContext (Cleaner Pattern)

from eval_platform.eval_utilities import EvalContext

with EvalContext(session_manager, "test_env") as ctx:
    # Agent executes here
    ops = BoxOperations(session)
    folder = ops.create_folder(name="Reports", parent_id="0", user_id="user-123")

    # Get diff automatically
    assert len(ctx.inserts) == 1
    assert ctx.inserts[0]['name'] == 'Reports'
    assert ctx.inserts[0]['__table__'] == 'box_folders'

# Snapshots automatically cleaned up

Clearing State Between Tests

from eval_platform.eval_utilities import clear_environment

# Clear all data (keeps schema structure)
clear_environment(session, "test_env")

Service-Specific Integration Status

✅ Box Service

Status: Fully migrated
Typed Operations: BoxOperations with all CRUD operations
Tests: 5 passing SQLite integration tests
Files: src/services/box/database/typed_operations.py
Key Operations: create_user, create_folder, create_file, update_folder, delete_folder

✅ Calendar Service (Google Calendar)

Status: Fully migrated
Typed Operations: CalendarOperations with calendar, event, and user operations
Tests: 6 passing SQLite integration tests
Files: src/services/calendar/database/typed_operations.py
Key Operations: create_user, create_calendar, create_event, list_events, update_event
Notes: Requires user_id for all event operations (permissions checking)

✅ Slack Service

Status: Fully migrated
Typed Operations: SlackOperations with team, channel, message operations
Tests: 5 passing SQLite integration tests
Files: src/services/slack/database/typed_operations.py
Key Operations: create_team, create_user, create_channel, send_message, add_emoji_reaction
Notes: Message text field is message_text in database model

✅ Linear Service

Status: Comprehensive CRUD API complete with 80+ operations
Typed Operations: LinearOperations with full CRUD for 14 major entity types
Files:
- src/services/linear/database/typed_operations.py - 80+ operations (1,246 lines)
- src/services/linear/database/entity_defaults.py - Default factories (360 lines)
- examples/linear_crud_demo.py - Usage demonstration (280 lines)
- LINEAR_CRUD_SUMMARY.md - Detailed implementation notes
Key Operations (80+ methods total):
- Organization (2): create_organization, get_organization
- User (3): create_user, get_user, get_user_by_email
- Team (2): create_team, get_team
- Issue (4): Full CRUD with create, get, update, delete
- Comment (4): Full CRUD operations
- WorkflowState (2): Create and get workflow states
- Project (5): Full CRUD + list_projects
- ProjectMilestone (4): Full CRUD for milestones
- Cycle (4): Sprint/cycle management
- Initiative (4): High-level initiative tracking
- Document (4): Documentation management
- Attachment (3): File attachment handling
- IssueLabel (4): Label management with colors
- IssueRelation (4): Issue dependency tracking
Design Pattern: Entity defaults factory pattern
- Challenge: Linear's auto-generated GraphQL schema has extreme complexity
  - Organization: 102 required fields
  - Team: 50+ required fields
  - User: 30+ required fields
- Solution: entity_defaults.py provides factory functions
```
# Instead of manually setting 102 fields:
defaults = organization_defaults(name, url_key)
org = Organization(**defaults)
```
- All defaults support **kwargs for customization
AI Agent Tool Design:
- Clear type hints on all 80+ operations
- Comprehensive docstrings for each method
- Pydantic serialization: .model_dump(), .model_dump_json()
- Optional return types for proper None handling
Production Use:
- Pre-seed baseline entities (org, teams, states) in setup
- Use typed operations for standard CRUD workflows
- Complete entity defaults incrementally as needed
- See LINEAR_CRUD_SUMMARY.md for full details

Testing

All services include comprehensive SQLite integration tests demonstrating:

Basic CRUD operations
State management with snapshots
Environment clearing
Complete agent evaluation workflows
Multiple operations with diff tracking

Run all working tests:

cd backend
DATABASE_URL=sqlite:///dummy.db PYTHONPATH=src uv run pytest tests/test_sqlite_integration.py tests/test_calendar_sqlite_integration.py tests/test_slack_sqlite_integration.py -v

Test Results (as of migration):

✅ Box: 5/5 passing
✅ Calendar: 6/6 passing
✅ Slack: 5/5 passing
✅ Linear: Comprehensive CRUD operations implemented (80+ methods)

Total: 16/16 tests passing for Box, Calendar, and Slack services Linear: Full CRUD API implemented with entity defaults pattern

Database Compatibility

All services now support both PostgreSQL and SQLite:

Replaced PostgreSQL-specific JSONB with database-agnostic JSON across all schemas
Eval utilities auto-detect database dialect and use appropriate SQL
Integration tests use SQLite for fast, isolated testing
Production uses PostgreSQL with full schema isolation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI Agent Evaluation Framework Integration

Summary of Changes

1. All SQLAlchemy Models Have Pydantic Serialization

2. Typed Operations Wrapper (Class-Based API)

3. State Management Utilities for Evals

Taking Snapshots (Before/After Agent Actions)

Using EvalContext (Cleaner Pattern)

Clearing State Between Tests

Service-Specific Integration Status

✅ Box Service

✅ Calendar Service (Google Calendar)

✅ Slack Service

✅ Linear Service

Testing

Database Compatibility

FilesExpand file tree

AI_AGENT_INTEGRATION.md

Latest commit

History

AI_AGENT_INTEGRATION.md

File metadata and controls

AI Agent Evaluation Framework Integration

Summary of Changes

1. All SQLAlchemy Models Have Pydantic Serialization

2. Typed Operations Wrapper (Class-Based API)

3. State Management Utilities for Evals

Taking Snapshots (Before/After Agent Actions)

Using EvalContext (Cleaner Pattern)

Clearing State Between Tests

Service-Specific Integration Status

✅ Box Service

✅ Calendar Service (Google Calendar)

✅ Slack Service

✅ Linear Service

Testing

Database Compatibility