Skip to content

add eval capes to sdk#460

Merged
luke-e-schaefer merged 14 commits into
masterfrom
add-eval-capabilities
Jun 10, 2026
Merged

add eval capes to sdk#460
luke-e-schaefer merged 14 commits into
masterfrom
add-eval-capabilities

Conversation

@luke-e-schaefer

@luke-e-schaefer luke-e-schaefer commented May 12, 2026

Copy link
Copy Markdown
Contributor

resolves https://linear.app/scale-epd/issue/DE-7460

tests wont pass until https://github.com/scaleapi/scaleapi/pull/142963 is merged

Greptile Summary

This PR adds a new Evaluations V2 feature to the SDK, exposing COCO-style mAP/confusion-matrix/PR-curve metrics for model runs via a new EvaluationV2 resource and supporting DTOs.

  • NucleusClient gains create_evaluation_v2, get_evaluation_v2, and list_evaluations_v2; the EvaluationV2 object exposes wait_for_completion, charts, examples, delete, and refresh, all following existing SDK conventions for raw-response deletes, is not None payload guards, and DictCompatibleModel DTOs.
  • All previously nullable server fields (iou, prediction_metadata, item_metadata) are correctly typed as Optional in EvaluationV2MatchExample, preventing ValidationError on FN rows.
  • The _camelize_filter_value helper intentionally skips recursion into MetadataPredicate.value payloads, which is verified by a dedicated unit test.

Confidence Score: 5/5

Safe to merge; the new EvaluationV2 surface is additive, follows existing SDK patterns, and all previously flagged nullable-field bugs are resolved.

The implementation correctly handles Optional fields on FN/FP rows, uses is not None guards for empty-list payloads, delegates deletes through the established raw-response pathway (which still surfaces HTTP errors via handle_bad_response), and includes thorough unit-test coverage for every new method. The only finding is a wrong tag URL in the CHANGELOG heading.

CHANGELOG.md — the 0.18.5 heading links to the v0.18.4 tag.

Important Files Changed

Filename Overview
nucleus/evaluation_v2.py New EvaluationV2 resource with wait/charts/examples/delete/refresh; correct null checks, proper status enum comparisons, and raw-response delete pattern matching the rest of the SDK.
nucleus/data_transfer_object/evaluation_v2.py New DTOs for filters, charts, and match examples; all nullable server-side fields correctly typed as Optional, _camelize_filter_value helper intentionally preserves predicate value payloads.
nucleus/init.py Adds create_evaluation_v2, get_evaluation_v2, list_evaluations_v2 to NucleusClient; uses is not None for allowed_label_matches guard (correct), exports new types in all.
tests/test_evaluation_v2.py Comprehensive unit tests for all new paths: filter serialization, from_json parsing, wait_for_completion polling, delete positional-arg passthrough, and charts/examples HTTP body construction.
CHANGELOG.md Adds 0.18.5 entry; the release link in the heading incorrectly points to v0.18.4 instead of v0.18.5.
pyproject.toml Version bumped from 0.18.4 to 0.18.5, no other changes.
docs/index.rst Adds Evaluations V2 section to Sphinx docs with a correct end-to-end usage example.

Sequence Diagram

sequenceDiagram
    participant U as User
    participant C as NucleusClient
    participant A as Nucleus API

    U->>C: create_evaluation_v2(model_run_id, ...)
    C->>A: "POST modelRun/{id}/evaluationsV2"
    A-->>C: "{evaluation_id: "evalv2_*"}"
    C->>A: "GET evaluationsV2/{evalv2_*}"
    A-->>C: EvaluationV2 payload
    C-->>U: EvaluationV2

    loop wait_for_completion
        U->>C: refresh()
        C->>A: "GET evaluationsV2/{id}"
        A-->>C: status
        C-->>U: updated EvaluationV2
    end

    U->>C: "charts(iou_threshold=0.5, filters=...)"
    C->>A: "GET evaluationsV2/{id}/charts?iouThreshold=0.5"
    A-->>C: EvaluationV2Charts JSON
    C-->>U: EvaluationV2Charts

    U->>C: "examples(match_type="TP", ...)"
    C->>A: "POST evaluationsV2/{id}/examples"
    A-->>C: "{rows: [...], total: N}"
    C-->>U: EvaluationV2ExamplesPage

    U->>C: delete()
    C->>A: "DELETE evaluationsV2/{id}"
    A-->>C: 204 No Content
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
CHANGELOG.md:8
The 0.18.5 release heading links to the `v0.18.4` tag instead of `v0.18.5`.

```suggestion
## [0.18.5](https://github.com/scaleapi/nucleus-python-client/releases/tag/v0.18.5) - 2026-05-28
```

Reviews (9): Last reviewed commit: "fix pyproject" | Re-trigger Greptile

@luke-e-schaefer luke-e-schaefer self-assigned this May 12, 2026
Comment thread nucleus/data_transfer_object/evaluation_v2.py Outdated
Comment thread nucleus/__init__.py Outdated
luke-e-schaefer and others added 2 commits May 12, 2026 13:49
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Comment thread nucleus/data_transfer_object/evaluation_v2.py Outdated
luke-e-schaefer and others added 3 commits May 12, 2026 14:03
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Comment thread nucleus/data_transfer_object/evaluation_v2.py

@edwinpav edwinpav left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall nice work!

Two main things:

  1. I'd make sure that the user-facing docs/descriptions are not overly complex. Not everyone will know or even care about how the function works behind the scenes, just care what are the params, what are the returns, and the feature that the method provides.
  2. If you want to deploy a new sdk version with these changes, two more files need to be changed and added to this pr:
    1. CHANGELOG.md should be updated. The tag link that the CHANGELOG references will be created after this pr is merged into master. You'd add a new release with a new tag here: https://github.com/scaleapi/nucleus-python-client/releases. Feel free to ping for any questions! The process isn't super clear lol

    2. The sdk version under tool.poetry should be updated in pyproject.toml
      (see #457 as a reference pr)

Comment thread nucleus/__init__.py Outdated
Comment thread nucleus/__init__.py Outdated
Comment thread nucleus/evaluation_v2.py
self.__dict__.update(updated.__dict__)
return self

def wait_for_completion(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed because this is not integrated with NucleusJobs? I thought this type of functionality comes built in for the other async functions (dedup async also uses temporal)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct yeah I don't have any ties back to the nuc jobs currently (since this stuff isn't "technically" in nucleus)...I could set that up tho that would be simple

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see, ig if it's in the nucleus sdk might be worth doing that if it's simple. if it shows up on the nucleus jobs page ui that's probably fine but that's probably a call you have more context on to make

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think thats fine too. I'll run that in its own PR set tho after this one (i'll have to update scaleapi too)

Comment thread docs/index.rst Outdated
Comment thread nucleus/evaluation_v2.py Outdated
Comment thread nucleus/data_transfer_object/evaluation_v2.py
Comment thread nucleus/data_transfer_object/evaluation_v2.py Outdated
Comment thread nucleus/evaluation_v2.py Outdated
Comment thread tests/test_evaluation_v2.py
Comment thread nucleus/data_transfer_object/evaluation_v2.py
@luke-e-schaefer luke-e-schaefer requested a review from edwinpav May 28, 2026 22:55

@edwinpav edwinpav left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good, just one typo I saw and pyproject.toml still needs an update. After that should be good to go!

Comment thread CHANGELOG.md Outdated
@luke-e-schaefer luke-e-schaefer requested a review from edwinpav June 1, 2026 20:07
@luke-e-schaefer luke-e-schaefer merged commit 9bda4ae into master Jun 10, 2026
9 checks passed
@luke-e-schaefer luke-e-schaefer deleted the add-eval-capabilities branch June 10, 2026 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants