π΄ Priority: HIGH | Type: Testing
1. SUMMARY
- Current integration tests use mocked models and databases, missing true end-to-end validation of the full pipeline (real model inference β real database execution β result verification).
- Impact: Cannot verify actual accuracy, performance regressions, or model behavior changes in CI. Production bugs may slip through despite 695 unit tests passing.
2. SYSTEM CONTEXT
tests/
βββ integration/
β βββ test_api.py # API tests with mocked inference
β βββ test_database.py # DB tests with SQLite
βββ unit/ # 40+ unit test files (all mocked)
βββ conftest.py # Fixtures with mocked models
app/
βββ agent/engine.py # Agent orchestration (untested E2E)
βββ text2sql_engine.py # SQL generation (untested E2E)
βββ streaming.py # SSE streaming (untested E2E)
models/
βββ loader.py # Model loading (tested in isolation)
βββ inference.py # Inference (tested with mocks)
Current state: Unit tests mock InferenceEngine, ModelLoader, and database connections. No tests run actual model inference against real databases.
3. CURRENT STATE (with code)
π File: tests/conftest.py:45-60 (approximate)
@pytest.fixture
def mock_inference_engine():
"""Mock inference engine for testing."""
engine = MagicMock(spec=InferenceEngine)
engine.generate.return_value = InferenceResult(
generated_text="SELECT * FROM users",
sql="SELECT * FROM users",
confidence=0.95,
...
)
return engine
All tests use mocked inference - no real model calls.
π File: tests/integration/test_api.py:49-80
class TestQueryEndpoint:
"""Tests for query generation endpoint."""
# Uses test_client fixture which mocks the model
# Never runs actual inference
Integration tests don't test the actual inference pipeline.
4. PROPOSED SOLUTION
Create a dedicated E2E test suite that:
- Loads the actual model (or a smaller test model)
- Connects to a real test database with known schema
- Runs natural language queries through the full pipeline
- Verifies SQL correctness and result accuracy
π File: tests/e2e/test_full_pipeline.py (NEW)
import pytest
from app.agent import get_agent_engine
from db.connection import get_database
@pytest.fixture(scope="module")
async def real_agent():
"""Load real agent with actual model."""
engine = await get_agent_engine()
yield engine
@pytest.fixture(scope="module")
async def test_db():
"""Real test database with known data."""
db = await get_database()
# Seed with known test data
await seed_test_data(db)
yield db
await cleanup_test_data(db)
class TestFullPipeline:
@pytest.mark.e2e
@pytest.mark.slow
async def test_simple_select(self, real_agent, test_db):
result = await real_agent.generate_sql(
natural_query="Show all users",
database_id="test",
execute=True,
)
assert result.sql is not None
assert "SELECT" in result.sql.upper()
assert "users" in result.sql.lower()
assert len(result.results) == EXPECTED_USER_COUNT
@pytest.mark.e2e
async def test_aggregation(self, real_agent, test_db):
result = await real_agent.generate_sql(
natural_query="Count orders by status",
database_id="test",
execute=True,
)
assert "GROUP BY" in result.sql.upper()
assert result.confidence >= 0.7
5. IMPLEMENTATION CHECKLIST
Phase 1: Infrastructure
Phase 2: Core E2E Tests
Phase 3: Advanced E2E Tests
Phase 4: Performance E2E Tests
6. FILES TO MODIFY TABLE
| File |
Lines |
Action |
Description |
tests/e2e/__init__.py |
NEW |
Create |
E2E test package |
tests/e2e/conftest.py |
NEW |
Create |
Real fixtures (model, database) |
tests/e2e/test_full_pipeline.py |
NEW |
Create |
Core pipeline tests |
tests/e2e/test_streaming_e2e.py |
NEW |
Create |
Streaming endpoint tests |
tests/e2e/test_multidb_e2e.py |
NEW |
Create |
Multi-database tests |
tests/e2e/seed_data.py |
NEW |
Create |
Test data seeding utilities |
pyproject.toml |
TBD |
Modify |
Add E2E markers and config |
.github/workflows/ci.yml |
TBD |
Modify |
Add E2E test job |
docker-compose.test.yml |
NEW |
Create |
Test environment with real DB |
7. RISK ASSESSMENT
| Risk |
Impact |
Mitigation |
| E2E tests are slow (model loading) |
π‘ |
Run separately from unit tests; use module-scoped fixtures |
| GPU required for real inference |
π‘ |
Support CPU-only mode with smaller model; GitHub GPU runners |
| Flaky tests due to model non-determinism |
π‘ |
Set random seeds; use temperature=0; allow confidence ranges |
| Test data pollution |
π’ |
Use isolated test database; cleanup fixtures |
8. RELATED CONTEXT
- Current test suite: 695 unit tests, all mocked
- CI pipeline:
.github/workflows/ci.yml
- Model loading:
models/loader.py
- Agent engine:
app/agent/engine.py
- Benchmark datasets (Spider, WikiSQL) could be used for accuracy validation in Phase 4
π΄ Priority: HIGH | Type: Testing
1. SUMMARY
2. SYSTEM CONTEXT
Current state: Unit tests mock
InferenceEngine,ModelLoader, and database connections. No tests run actual model inference against real databases.3. CURRENT STATE (with code)
π File:
tests/conftest.py:45-60(approximate)All tests use mocked inference - no real model calls.
π File:
tests/integration/test_api.py:49-80Integration tests don't test the actual inference pipeline.
4. PROPOSED SOLUTION
Create a dedicated E2E test suite that:
π File:
tests/e2e/test_full_pipeline.py(NEW)5. IMPLEMENTATION CHECKLIST
Phase 1: Infrastructure
tests/e2e/directory structurePhase 2: Core E2E Tests
Phase 3: Advanced E2E Tests
Phase 4: Performance E2E Tests
6. FILES TO MODIFY TABLE
tests/e2e/__init__.pytests/e2e/conftest.pytests/e2e/test_full_pipeline.pytests/e2e/test_streaming_e2e.pytests/e2e/test_multidb_e2e.pytests/e2e/seed_data.pypyproject.toml.github/workflows/ci.ymldocker-compose.test.yml7. RISK ASSESSMENT
8. RELATED CONTEXT
.github/workflows/ci.ymlmodels/loader.pyapp/agent/engine.py