[IMPROVE] AI Evals

## Title


## Background


Set of representative questions:

https://github.com/CodeForPhilly/balancer-main/issues/345#issuecomment-3433329904

https://github.com/CodeForPhilly/balancer-main/issues/411#issuecomment-3712677508

https://github.com/CodeForPhilly/balancer-main/tree/develop/evaluation


## Current State





## Acceptance Criteria
- [] 

## Approach


Start with [error analysis](https://hamel.dev/blog/posts/evals-faq/#q-why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed), not infrastructure. Spend 30 minutes manually reviewing 20-50 LLM outputs whenever you make significant changes. Use one [domain expert](https://hamel.dev/blog/posts/evals-faq/#q-how-many-people-should-annotate-my-llm-outputs) who understands your users as your quality decision maker (a “[benevolent dictator](https://hamel.dev/blog/posts/evals-faq/#q-how-many-people-should-annotate-my-llm-outputs)”).

OpenAI API Dashboard has total duration and cost metrics 



## References


## Risks and Rollback


  ## Risks
  
  | Risk | Severity | Mitigation |
  |------|----------|------------|
  | Assistant endpoint behavior changes from the refactor (request path now flows through `run_assistant`). | Medium | The `(response_output_text,
  final_response_id)` response contract is unchanged; service-level unit tests pass. Residual: OpenAI and the DB are mocked in tests, so the live retrieval
  path is not exercised in CI — spot-check a real query after deploy. |
  | Missing input validation: an omitted/blank `message` is not rejected — it becomes the literal string `"None"` (`str(None)`) in the model input, producing
  confusing output. | Low | Known and deferred; tracked by a TODO in `views.py`. No crash or data impact. Fix is to return 400 on omitted/blank `message` (+
  add 400 to the schema). |
  | Eval tooling (`eval_assistant.py`, `review.ipynb`) has first-run fragility — `sys.path` insert depth and the `pandas` dependency. | Low | Offline-only: not
  in the production request path, so zero runtime impact. The `pandas` import is deferred into `main()` so it cannot affect the app or test collection; the
  `sys.path` caveat is documented inline. |
  | Reduced test count — low-signal glue/framework tests were removed. | Low | Real-logic coverage (tool loop, retrieval dispatch, orchestration decisions) is
  retained. View HTTP-contract coverage is deferred to a future DB-backed integration test (TODO). |
  | Full-suite `pytest` surfaces pre-existing pgvector test-DB errors. | Low | Not introduced by this branch; the assistant tests are DB-free and pass when
  scoped (`pytest api/views/assistant/`). |
  
  ## Rollback

  - **No database migrations** are introduced in this branch — the changes are code, tests, and offline tooling only. Rollback is therefore **code-only**: no
  schema or data reversal required.
  - Revert the merge commit: `git revert -m 1 <merge_sha>`.
  - The eval script and notebook are independent of the endpoint and can be removed on their own without touching the request path.
  - The pre-refactor monolithic view remains in git history if a targeted hotfix is preferred over a full revert.

  Two things to confirm before you publish so the doc stays honest:
  - "No migrations" — quick check: git diff develop --name-only | grep migrations should be empty. I'm confident it is, but verify since the claim anchors the
  rollback story.
  - The input-validation risk is the one a reviewer is most likely to push on. If you'd rather not ship a known-gap, that's the cue to implement the 400 now
  instead of deferring — say the word and I'll apply the guard + schema update we discussed.



## Screenshots / Recordings


## Related PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[IMPROVE] AI Evals #490

Title

Background

Current State

Acceptance Criteria

Approach

References

Risks and Rollback

Risks

Rollback

Screenshots / Recordings

Related PR

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Risk	Severity	Mitigation
Assistant endpoint behavior changes from the refactor (request path now flows through `run_assistant`).	Medium	The `(response_output_text,
final_response_id)` response contract is unchanged; service-level unit tests pass. Residual: OpenAI and the DB are mocked in tests, so the live retrieval
path is not exercised in CI — spot-check a real query after deploy.
Missing input validation: an omitted/blank `message` is not rejected — it becomes the literal string `"None"` (`str(None)`) in the model input, producing
confusing output.	Low	Known and deferred; tracked by a TODO in `views.py`. No crash or data impact. Fix is to return 400 on omitted/blank `message` (+
add 400 to the schema).
Eval tooling (`eval_assistant.py`, `review.ipynb`) has first-run fragility — `sys.path` insert depth and the `pandas` dependency.	Low	Offline-only: not
in the production request path, so zero runtime impact. The `pandas` import is deferred into `main()` so it cannot affect the app or test collection; the
`sys.path` caveat is documented inline.
Reduced test count — low-signal glue/framework tests were removed.	Low	Real-logic coverage (tool loop, retrieval dispatch, orchestration decisions) is
retained. View HTTP-contract coverage is deferred to a future DB-backed integration test (TODO).
Full-suite `pytest` surfaces pre-existing pgvector test-DB errors.	Low	Not introduced by this branch; the assistant tests are DB-free and pass when
scoped (`pytest api/views/assistant/`).

Uh oh!

Uh oh!

[IMPROVE] AI Evals #490

Description

Title

Background

Current State

Acceptance Criteria

Approach

References

Risks and Rollback

Risks

Rollback

Screenshots / Recordings

Related PR

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions