[HIGH] End-to-End Integration Tests with Real Model and Database

## 🔴 Priority: HIGH | Type: Testing

### 1. SUMMARY

- Current integration tests use mocked models and databases, missing true end-to-end validation of the full pipeline (real model inference → real database execution → result verification).
- Impact: Cannot verify actual accuracy, performance regressions, or model behavior changes in CI. Production bugs may slip through despite 695 unit tests passing.

### 2. SYSTEM CONTEXT

```
tests/
├── integration/
│   ├── test_api.py           # API tests with mocked inference
│   └── test_database.py      # DB tests with SQLite
├── unit/                     # 40+ unit test files (all mocked)
└── conftest.py               # Fixtures with mocked models

app/
├── agent/engine.py           # Agent orchestration (untested E2E)
├── text2sql_engine.py        # SQL generation (untested E2E)
└── streaming.py              # SSE streaming (untested E2E)

models/
├── loader.py                 # Model loading (tested in isolation)
└── inference.py              # Inference (tested with mocks)
```

Current state: Unit tests mock `InferenceEngine`, `ModelLoader`, and database connections. No tests run actual model inference against real databases.

### 3. CURRENT STATE (with code)

📄 **File:** `tests/conftest.py:45-60` (approximate)
```python
@pytest.fixture
def mock_inference_engine():
    """Mock inference engine for testing."""
    engine = MagicMock(spec=InferenceEngine)
    engine.generate.return_value = InferenceResult(
        generated_text="SELECT * FROM users",
        sql="SELECT * FROM users",
        confidence=0.95,
        ...
    )
    return engine
```
*All tests use mocked inference - no real model calls.*

📄 **File:** `tests/integration/test_api.py:49-80`
```python
class TestQueryEndpoint:
    """Tests for query generation endpoint."""
    # Uses test_client fixture which mocks the model
    # Never runs actual inference
```
*Integration tests don't test the actual inference pipeline.*

### 4. PROPOSED SOLUTION

Create a dedicated E2E test suite that:
1. Loads the actual model (or a smaller test model)
2. Connects to a real test database with known schema
3. Runs natural language queries through the full pipeline
4. Verifies SQL correctness and result accuracy

📄 **File:** `tests/e2e/test_full_pipeline.py` (NEW)
```python
import pytest
from app.agent import get_agent_engine
from db.connection import get_database

@pytest.fixture(scope="module")
async def real_agent():
    """Load real agent with actual model."""
    engine = await get_agent_engine()
    yield engine

@pytest.fixture(scope="module")
async def test_db():
    """Real test database with known data."""
    db = await get_database()
    # Seed with known test data
    await seed_test_data(db)
    yield db
    await cleanup_test_data(db)

class TestFullPipeline:
    @pytest.mark.e2e
    @pytest.mark.slow
    async def test_simple_select(self, real_agent, test_db):
        result = await real_agent.generate_sql(
            natural_query="Show all users",
            database_id="test",
            execute=True,
        )
        assert result.sql is not None
        assert "SELECT" in result.sql.upper()
        assert "users" in result.sql.lower()
        assert len(result.results) == EXPECTED_USER_COUNT

    @pytest.mark.e2e
    async def test_aggregation(self, real_agent, test_db):
        result = await real_agent.generate_sql(
            natural_query="Count orders by status",
            database_id="test",
            execute=True,
        )
        assert "GROUP BY" in result.sql.upper()
        assert result.confidence >= 0.7
```

### 5. IMPLEMENTATION CHECKLIST

**Phase 1: Infrastructure**
- [ ] Create `tests/e2e/` directory structure
- [ ] Add E2E pytest markers and configuration
- [ ] Create test database seeding scripts with known data
- [ ] Add CI job for E2E tests (separate from unit tests, runs on GPU runner or with small model)

**Phase 2: Core E2E Tests**
- [ ] Simple SELECT queries (single table)
- [ ] JOIN queries (multi-table)
- [ ] Aggregation queries (GROUP BY, SUM, COUNT)
- [ ] Filtering queries (WHERE clauses)
- [ ] Ordering and pagination (ORDER BY, LIMIT)

**Phase 3: Advanced E2E Tests**
- [ ] Multi-database routing tests
- [ ] Streaming endpoint tests
- [ ] Self-correction behavior tests
- [ ] Error handling and recovery tests

**Phase 4: Performance E2E Tests**
- [ ] Latency benchmarks (p50, p95, p99)
- [ ] Throughput under load
- [ ] Memory usage during inference

### 6. FILES TO MODIFY TABLE

| File | Lines | Action | Description |
|------|-------|--------|-------------|
| `tests/e2e/__init__.py` | NEW | Create | E2E test package |
| `tests/e2e/conftest.py` | NEW | Create | Real fixtures (model, database) |
| `tests/e2e/test_full_pipeline.py` | NEW | Create | Core pipeline tests |
| `tests/e2e/test_streaming_e2e.py` | NEW | Create | Streaming endpoint tests |
| `tests/e2e/test_multidb_e2e.py` | NEW | Create | Multi-database tests |
| `tests/e2e/seed_data.py` | NEW | Create | Test data seeding utilities |
| `pyproject.toml` | TBD | Modify | Add E2E markers and config |
| `.github/workflows/ci.yml` | TBD | Modify | Add E2E test job |
| `docker-compose.test.yml` | NEW | Create | Test environment with real DB |

### 7. RISK ASSESSMENT

| Risk | Impact | Mitigation |
|------|--------|------------|
| E2E tests are slow (model loading) | 🟡 | Run separately from unit tests; use module-scoped fixtures |
| GPU required for real inference | 🟡 | Support CPU-only mode with smaller model; GitHub GPU runners |
| Flaky tests due to model non-determinism | 🟡 | Set random seeds; use temperature=0; allow confidence ranges |
| Test data pollution | 🟢 | Use isolated test database; cleanup fixtures |

### 8. RELATED CONTEXT

- Current test suite: 695 unit tests, all mocked
- CI pipeline: `.github/workflows/ci.yml`
- Model loading: `models/loader.py`
- Agent engine: `app/agent/engine.py`
- Benchmark datasets (Spider, WikiSQL) could be used for accuracy validation in Phase 4


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HIGH] End-to-End Integration Tests with Real Model and Database #43

🔴 Priority: HIGH | Type: Testing

1. SUMMARY

2. SYSTEM CONTEXT

3. CURRENT STATE (with code)

4. PROPOSED SOLUTION

5. IMPLEMENTATION CHECKLIST

6. FILES TO MODIFY TABLE

7. RISK ASSESSMENT

8. RELATED CONTEXT

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

File	Lines	Action	Description
`tests/e2e/__init__.py`	NEW	Create	E2E test package
`tests/e2e/conftest.py`	NEW	Create	Real fixtures (model, database)
`tests/e2e/test_full_pipeline.py`	NEW	Create	Core pipeline tests
`tests/e2e/test_streaming_e2e.py`	NEW	Create	Streaming endpoint tests
`tests/e2e/test_multidb_e2e.py`	NEW	Create	Multi-database tests
`tests/e2e/seed_data.py`	NEW	Create	Test data seeding utilities
`pyproject.toml`	TBD	Modify	Add E2E markers and config
`.github/workflows/ci.yml`	TBD	Modify	Add E2E test job
`docker-compose.test.yml`	NEW	Create	Test environment with real DB

Risk	Impact	Mitigation
E2E tests are slow (model loading)	🟡	Run separately from unit tests; use module-scoped fixtures
GPU required for real inference	🟡	Support CPU-only mode with smaller model; GitHub GPU runners
Flaky tests due to model non-determinism	🟡	Set random seeds; use temperature=0; allow confidence ranges
Test data pollution	🟢	Use isolated test database; cleanup fixtures

[HIGH] End-to-End Integration Tests with Real Model and Database #43

Description

🔴 Priority: HIGH | Type: Testing

1. SUMMARY

2. SYSTEM CONTEXT

3. CURRENT STATE (with code)

4. PROPOSED SOLUTION

5. IMPLEMENTATION CHECKLIST

6. FILES TO MODIFY TABLE

7. RISK ASSESSMENT

8. RELATED CONTEXT

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions