Skip to content

Latest commit

 

History

History
240 lines (195 loc) · 8.36 KB

File metadata and controls

240 lines (195 loc) · 8.36 KB

AI Agent Evaluation Framework Integration

This document describes the changes made to enable this codebase to work with AI agent evaluation frameworks directly as an SDK instead of over API proxy.

Summary of Changes

1. All SQLAlchemy Models Have Pydantic Serialization

What Changed:

  • Added PydanticMixin to all base classes

2. Typed Operations Wrapper (Class-Based API)

What Changed:

  • Created BoxOperations class that encapsulates session management
  • AI agents don't need to pass session to every function call
  • Fixed parameter names to match actual operations.py signatures

Files Created:

  • src/services/box/database/typed_operations.py - Type-safe wrapper for Box operations

Usage:

from services.box.database.typed_operations import BoxOperations
from sqlalchemy.orm import Session

# Initialize once with session
ops = BoxOperations(session)

# AI agents call methods without passing session
user = ops.create_user(
    name="John Doe",
    login="john@example.com",
    job_title="Engineer"
)

folder = ops.create_folder(
    name="Reports",
    parent_id="0",
    user_id="user-123"
)

# All returned models have Pydantic serialization
folder_data = folder.model_dump()  # Returns dict for AI agent

For AI Agent SDKs:

# Anthropic Claude SDK
tools = [
    {
        "name": "create_folder",
        "description": "Create a new folder in Box",
        "input_schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string", "description": "Folder name"},
                "parent_id": {"type": "string", "description": "Parent folder ID"},
                "user_id": {"type": "string", "description": "Owner user ID"}
            },
            "required": ["name", "parent_id", "user_id"]
        }
    }
]

# Tool implementation
def execute_tool(tool_name, arguments):
    ops = BoxOperations(session)

    if tool_name == "create_folder":
        folder = ops.create_folder(**arguments)
        return folder.model_dump()  # Serialize for agent

3. State Management Utilities for Evals

What Changed:

  • Created unified interface for seeding, clearing, and diffing database state
  • Utilities work across all services

Files Created:

  • src/eval_platform/eval_utilities.py - State management for evals

Usage:

Taking Snapshots (Before/After Agent Actions)

from eval_platform.eval_utilities import create_snapshot, get_diff

# Before agent runs
before_snapshot = create_snapshot(session, "test_env", "before")

# Agent executes actions...
ops = BoxOperations(session)
folder = ops.create_folder(name="Reports", parent_id="0", user_id="user-123")

# After agent runs
after_snapshot = create_snapshot(session, "test_env", "after")

# Get diff
diff = get_diff(session_manager, "test_env", "before", "after")

# Analyze changes
print(f"Inserts: {len(diff.inserts)}")  # New rows
print(f"Updates: {len(diff.updates)}")  # Modified rows
print(f"Deletes: {len(diff.deletes)}")  # Deleted rows

# Each change includes __table__ to identify origin
for insert in diff.inserts:
    print(f"Created {insert['__table__']}: {insert['name']}")

Using EvalContext (Cleaner Pattern)

from eval_platform.eval_utilities import EvalContext

with EvalContext(session_manager, "test_env") as ctx:
    # Agent executes here
    ops = BoxOperations(session)
    folder = ops.create_folder(name="Reports", parent_id="0", user_id="user-123")

    # Get diff automatically
    assert len(ctx.inserts) == 1
    assert ctx.inserts[0]['name'] == 'Reports'
    assert ctx.inserts[0]['__table__'] == 'box_folders'

# Snapshots automatically cleaned up

Clearing State Between Tests

from eval_platform.eval_utilities import clear_environment

# Clear all data (keeps schema structure)
clear_environment(session, "test_env")

Service-Specific Integration Status

✅ Box Service

  • Status: Fully migrated
  • Typed Operations: BoxOperations with all CRUD operations
  • Tests: 5 passing SQLite integration tests
  • Files: src/services/box/database/typed_operations.py
  • Key Operations: create_user, create_folder, create_file, update_folder, delete_folder

✅ Calendar Service (Google Calendar)

  • Status: Fully migrated
  • Typed Operations: CalendarOperations with calendar, event, and user operations
  • Tests: 6 passing SQLite integration tests
  • Files: src/services/calendar/database/typed_operations.py
  • Key Operations: create_user, create_calendar, create_event, list_events, update_event
  • Notes: Requires user_id for all event operations (permissions checking)

✅ Slack Service

  • Status: Fully migrated
  • Typed Operations: SlackOperations with team, channel, message operations
  • Tests: 5 passing SQLite integration tests
  • Files: src/services/slack/database/typed_operations.py
  • Key Operations: create_team, create_user, create_channel, send_message, add_emoji_reaction
  • Notes: Message text field is message_text in database model

✅ Linear Service

  • Status: Comprehensive CRUD API complete with 80+ operations
  • Typed Operations: LinearOperations with full CRUD for 14 major entity types
  • Files:
    • src/services/linear/database/typed_operations.py - 80+ operations (1,246 lines)
    • src/services/linear/database/entity_defaults.py - Default factories (360 lines)
    • examples/linear_crud_demo.py - Usage demonstration (280 lines)
    • LINEAR_CRUD_SUMMARY.md - Detailed implementation notes
  • Key Operations (80+ methods total):
    • Organization (2): create_organization, get_organization
    • User (3): create_user, get_user, get_user_by_email
    • Team (2): create_team, get_team
    • Issue (4): Full CRUD with create, get, update, delete
    • Comment (4): Full CRUD operations
    • WorkflowState (2): Create and get workflow states
    • Project (5): Full CRUD + list_projects
    • ProjectMilestone (4): Full CRUD for milestones
    • Cycle (4): Sprint/cycle management
    • Initiative (4): High-level initiative tracking
    • Document (4): Documentation management
    • Attachment (3): File attachment handling
    • IssueLabel (4): Label management with colors
    • IssueRelation (4): Issue dependency tracking
  • Design Pattern: Entity defaults factory pattern
    • Challenge: Linear's auto-generated GraphQL schema has extreme complexity
      • Organization: 102 required fields
      • Team: 50+ required fields
      • User: 30+ required fields
    • Solution: entity_defaults.py provides factory functions
      # Instead of manually setting 102 fields:
      defaults = organization_defaults(name, url_key)
      org = Organization(**defaults)
    • All defaults support **kwargs for customization
  • AI Agent Tool Design:
    • Clear type hints on all 80+ operations
    • Comprehensive docstrings for each method
    • Pydantic serialization: .model_dump(), .model_dump_json()
    • Optional return types for proper None handling
  • Production Use:
    • Pre-seed baseline entities (org, teams, states) in setup
    • Use typed operations for standard CRUD workflows
    • Complete entity defaults incrementally as needed
    • See LINEAR_CRUD_SUMMARY.md for full details

Testing

All services include comprehensive SQLite integration tests demonstrating:

  • Basic CRUD operations
  • State management with snapshots
  • Environment clearing
  • Complete agent evaluation workflows
  • Multiple operations with diff tracking

Run all working tests:

cd backend
DATABASE_URL=sqlite:///dummy.db PYTHONPATH=src uv run pytest tests/test_sqlite_integration.py tests/test_calendar_sqlite_integration.py tests/test_slack_sqlite_integration.py -v

Test Results (as of migration):

  • ✅ Box: 5/5 passing
  • ✅ Calendar: 6/6 passing
  • ✅ Slack: 5/5 passing
  • ✅ Linear: Comprehensive CRUD operations implemented (80+ methods)

Total: 16/16 tests passing for Box, Calendar, and Slack services Linear: Full CRUD API implemented with entity defaults pattern

Database Compatibility

All services now support both PostgreSQL and SQLite:

  • Replaced PostgreSQL-specific JSONB with database-agnostic JSON across all schemas
  • Eval utilities auto-detect database dialect and use appropriate SQL
  • Integration tests use SQLite for fast, isolated testing
  • Production uses PostgreSQL with full schema isolation