Skip to content

[IMPROVE] AI Evals #490

Description

@sahilds1

Title

Background

Set of representative questions:

#345 (comment)

#411 (comment)

https://github.com/CodeForPhilly/balancer-main/tree/develop/evaluation

Current State

Acceptance Criteria

  • []

Approach

Start with error analysis, not infrastructure. Spend 30 minutes manually reviewing 20-50 LLM outputs whenever you make significant changes. Use one domain expert who understands your users as your quality decision maker (a “benevolent dictator”).

OpenAI API Dashboard has total duration and cost metrics

References

Risks and Rollback

Risks

Risk Severity Mitigation
Assistant endpoint behavior changes from the refactor (request path now flows through run_assistant). Medium The `(response_output_text,
final_response_id)` response contract is unchanged; service-level unit tests pass. Residual: OpenAI and the DB are mocked in tests, so the live retrieval
path is not exercised in CI — spot-check a real query after deploy.
Missing input validation: an omitted/blank message is not rejected — it becomes the literal string "None" (str(None)) in the model input, producing
confusing output. Low Known and deferred; tracked by a TODO in views.py. No crash or data impact. Fix is to return 400 on omitted/blank message (+
add 400 to the schema).
Eval tooling (eval_assistant.py, review.ipynb) has first-run fragility — sys.path insert depth and the pandas dependency. Low Offline-only: not
in the production request path, so zero runtime impact. The pandas import is deferred into main() so it cannot affect the app or test collection; the
sys.path caveat is documented inline.
Reduced test count — low-signal glue/framework tests were removed. Low Real-logic coverage (tool loop, retrieval dispatch, orchestration decisions) is
retained. View HTTP-contract coverage is deferred to a future DB-backed integration test (TODO).
Full-suite pytest surfaces pre-existing pgvector test-DB errors. Low Not introduced by this branch; the assistant tests are DB-free and pass when
scoped (pytest api/views/assistant/).

Rollback

  • No database migrations are introduced in this branch — the changes are code, tests, and offline tooling only. Rollback is therefore code-only: no
    schema or data reversal required.
  • Revert the merge commit: git revert -m 1 <merge_sha>.
  • The eval script and notebook are independent of the endpoint and can be removed on their own without touching the request path.
  • The pre-refactor monolithic view remains in git history if a targeted hotfix is preferred over a full revert.

Two things to confirm before you publish so the doc stays honest:

  • "No migrations" — quick check: git diff develop --name-only | grep migrations should be empty. I'm confident it is, but verify since the claim anchors the
    rollback story.
  • The input-validation risk is the one a reviewer is most likely to push on. If you'd rather not ship a known-gap, that's the cue to implement the 400 now
    instead of deferring — say the word and I'll apply the guard + schema update we discussed.

Screenshots / Recordings

Related PR

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
Priority

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions