Skip to content

Add COCO annotation import/export support#679

Open
deanp70 wants to merge 4 commits intomainfrom
coco_converter
Open

Add COCO annotation import/export support#679
deanp70 wants to merge 4 commits intomainfrom
coco_converter

Conversation

@deanp70
Copy link
Copy Markdown
Member

@deanp70 deanp70 commented Mar 25, 2026

Based on changes to dagshub-annotation-converter to the same effect: DagsHub/dagshub-annotation-converter#26

Implementation Details

The PR extends annotation support through a multi-layered approach:

Importer Module Enhancements (importer.py):

  • Extended AnnotationType literal to include "coco",

Metadata Annotations Support (metadata.py):

  • Introduced add_coco_annotation method to load COCO-format JSON and integrate annotations into metadata

Query Result Export API (query_result.py):

  • One new export methods:
    • export_as_coco() - generates COCO JSON with optional class mapping and custom output filename
  • Introduced _resolve_annotation_field to auto-select available annotation field with alphabetical fallback
  • Refactored YOLO export to use the new annotation field resolution

Test Coverage:

  • test_coco.py (218 lines): Import/export flows, filename rewriting, field resolution, class mapping, output customization

Code Quality and Testing Infrastructure:

  • Enhanced ruff.toml linting configuration with comprehensive rule set including:
    • Security checks (flake8-bandit)
    • Bug detection (flake8-bugbear)
    • Try/except pattern validation (tryceratops)
    • Unused argument detection and other design pattern checks
    • Per-file exceptions for __init__.py and conftest.py
  • Added DatasourceType.REPOSITORY assignment in test fixtures (tests/data_engine/conftest.py)
  • Added id property to MockRepoAPI test utility
  • Minor refactor in token_auth.py (no behavioral change - logical equivalence in boolean expression)

Documentation Note: Per PR comments, the supported formats list at dagshub.com/docs/use_cases/annotation/#supported-formats-for-annotation_imports requires manual update to reflect COCO BBOX, and COCO Segmentation support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@deanp70 deanp70 requested a review from kbolashev March 25, 2026 11:41
@deanp70 deanp70 self-assigned this Mar 25, 2026
@deanp70 deanp70 added the enhancement New feature or request label Mar 25, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 25, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d3bc5b58-e6d3-4ff9-bb87-9545e4d77772

📥 Commits

Reviewing files that changed from the base of the PR and between e68048a and 4f830e4.

📒 Files selected for processing (4)
  • dagshub/data_engine/annotation/importer.py
  • dagshub/data_engine/model/query_result.py
  • tests/data_engine/annotation_import/test_coco.py
  • tests/data_engine/conftest.py
✅ Files skipped from review due to trivial changes (1)
  • tests/data_engine/annotation_import/test_coco.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • dagshub/data_engine/model/query_result.py
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: build (3.9)
  • GitHub Check: build (3.13)
  • GitHub Check: build (3.12)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.11)
🔇 Additional comments (6)
tests/data_engine/conftest.py (2)

2-3: Good fixture import additions for new datasource stubbing.

These imports are purposeful and correctly support the new repository/source-prefix mocking behavior in this fixture.

Also applies to: 10-10


31-31: Repository-mode fixture setup is consistent with downstream path logic.

Setting source_type to DatasourceType.REPOSITORY and stubbing source_prefix to PurePosixPath() matches the expected non-bucket branch and keeps test path composition deterministic.

Also applies to: 39-39

dagshub/data_engine/annotation/importer.py (4)

6-6: LGTM!

Import follows the established pattern for other annotation converters.


20-20: LGTM!

Type extension is consistent with the existing annotation types.


110-112: Potential path mismatch when annotations file is in a subdirectory.

With keep_source_prefix=True, the file preserves its directory structure in the download destination (e.g., tmp_dir/subdir/annotations.json). However, line 78 uses annotations_file.name which only takes the filename:

annotations_file = tmp_dir_path / annotations_file.name  # tmp_dir/annotations.json

This mismatch would cause a FileNotFoundError when the annotations file is not at the repository root. The same issue exists in the CVAT branch (line 99).

Consider using the full path:

🐛 Proposed fix
 # In import_annotations(), around line 78:
 if self.load_from == "repo":
     self.download_annotations(tmp_dir_path)
-    annotations_file = tmp_dir_path / annotations_file.name
+    annotations_file = tmp_dir_path / self.annotations_file

[raise_major_issue, request_verification]

#!/bin/bash
# Check if there are tests covering subdirectory annotation paths
rg -n "annotations_file.*/" --type py -g '*test*'
# Check how CVAT/YOLO handle this in existing tests
rg -n -B2 -A2 "load_from.*repo" --type py -g '*test*'

89-90: No issues found. The tuple unpacking pattern annotation_dict, _ = load_coco_from_file(annotations_file) is correct and consistent with the YOLO branch. Tests confirm the function returns a tuple as expected, with the second value containing metadata (similar to YOLO's pattern).


📝 Walkthrough

Walkthrough

The pull request adds COCO annotation import and export support to the DagsHub data engine. Changes include updating the annotation importer to handle COCO format, adding a new export_as_coco() method to QueryResult, refactoring annotation field resolution logic, updating test fixtures and mocks, adding comprehensive test coverage, and switching the dagshub-annotation-converter dependency to a VCS-based build from the coco_converter branch.

Changes

Cohort / File(s) Summary
COCO Import Support
dagshub/data_engine/annotation/importer.py
Extended AnnotationType to include "coco". Added COCO import branch via load_coco_from_file(). Introduced repository download branch for COCO annotation files. Updated filename remapping logic to conditionally handle cases where filename may be None.
COCO Export Support
dagshub/data_engine/model/query_result.py
Introduced export_as_coco(...) method to export annotations in COCO JSON format with image downloading and optional category mapping. Added _resolve_annotation_field(...) private helper to centralize annotation field selection. Refactored export_as_yolo(...) to use the new resolution helper.
COCO Feature Tests
tests/data_engine/annotation_import/test_coco.py
New test module with end-to-end coverage for COCO import/export: validates annotation loading from disk, _resolve_annotation_field() behavior across multiple scenarios, bbox coordinate conversion, category mapping, and aggregation of images/annotations across datapoints.
Test Infrastructure Updates
tests/data_engine/conftest.py, tests/mocks/repo_api.py
Updated datasource mock to explicitly set source_type to DatasourceType.REPOSITORY and stub source_prefix property. Added id property to MockRepoAPI returning constant value 1.
Dependency Management
setup.py
Switched dagshub-annotation-converter from pinned minimum version (>=0.1.12) to VCS-based installation referencing the coco_converter branch of the DagsHub annotation converter repository.

Sequence Diagram

sequenceDiagram
    participant User
    participant QueryResult
    participant FieldResolver
    participant AnnotationCollector
    participant ImageDownloader
    participant COCOExporter
    participant FileSystem

    User->>QueryResult: export_as_coco(download_dir, annotation_field, ...)
    QueryResult->>FieldResolver: _resolve_annotation_field(annotation_field)
    FieldResolver-->>QueryResult: resolved_field_name
    
    QueryResult->>AnnotationCollector: Collect all annotations from resolved field
    AnnotationCollector-->>QueryResult: annotations list
    
    alt annotations exist
        QueryResult->>ImageDownloader: Download images to download_dir/data
        ImageDownloader-->>QueryResult: Downloaded image paths
        
        QueryResult->>COCOExporter: export_to_coco_file(annotations, images, categories)
        COCOExporter->>FileSystem: Write annotations.json
        FileSystem-->>COCOExporter: Write confirmation
        COCOExporter-->>QueryResult: Output path
        QueryResult-->>User: Return Path to COCO annotations
    else no annotations
        QueryResult-->>User: Raise RuntimeError
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hop, hop, the COCO format's here!
Import and export, crystal clear,
Bboxes dancing, images aligned,
A resolver makes annotation fields refined,
Tests in place, the warren's glad! 🥕✨

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding COCO annotation import/export support, which is the primary focus of the changeset.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, providing implementation details, test coverage, and tooling enhancements that align with the changes shown in the raw summary.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch coco_converter

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dagshub/data_engine/annotation/importer.py (1)

83-92: ⚠️ Potential issue | 🟡 Minor

Add else clause to handle unsupported annotation types.

If annotations_type is not one of the supported values, annotation_dict remains unbound, causing an UnboundLocalError on line 92.

Proposed fix
             elif self.annotations_type == "coco":
                 annotation_dict, _ = load_coco_from_file(annotations_file)
+            else:
+                raise ValueError(f"Unsupported annotation type: {self.annotations_type}")

             return annotation_dict
♻️ Duplicate comments (1)
dagshub/__init__.py (1)

1-1: ⚠️ Potential issue | 🔴 Critical

Version changes should not be included in PRs.

The version bump has been flagged previously as a blocker issue. Version changes should be handled separately from feature PRs, typically during the release process.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 13ca3252-8076-4a28-a78a-8200f0235dc6

📥 Commits

Reviewing files that changed from the base of the PR and between af5fe5d and e68048a.

📒 Files selected for processing (9)
  • dagshub/__init__.py
  • dagshub/auth/token_auth.py
  • dagshub/data_engine/annotation/importer.py
  • dagshub/data_engine/annotation/metadata.py
  • dagshub/data_engine/model/query_result.py
  • setup.py
  • tests/data_engine/annotation_import/test_coco.py
  • tests/data_engine/conftest.py
  • tests/mocks/repo_api.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: build (3.11)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.9)
  • GitHub Check: build (3.13)
  • GitHub Check: build (3.12)
🔇 Additional comments (11)
dagshub/auth/token_auth.py (1)

40-40: Readability improvement with no behavior change.

Line 40 is an equivalent condition and reads more idiomatically in Python. No concerns here.

dagshub/data_engine/annotation/metadata.py (2)

284-284: Move import to module level.

The import should be at the top of the file with other imports, not inside the function.

Proposed fix

Add to imports at top of file:

from dagshub_annotation_converter.converters.coco import load_coco_from_json_string

Then remove line 284.


274-294: LGTM on method implementation.

The method correctly:

  1. Parses COCO JSON via the converter
  2. Rewrites all annotation filenames to the datapoint's path (expected behavior when adding annotations to a specific datapoint)
  3. Extends the existing annotations list
  4. Persists via _update_datapoint()

The filename rewriting is intentional here since you're associating external annotations with a specific datapoint.

tests/data_engine/annotation_import/test_coco.py (2)

20-23: Consider moving mock to shared fixture.

This mock_source_prefix fixture should be integrated into _create_mock_datasource() in conftest.py so it works for all datasource fixture users, not just this test module.


1-218: Good test coverage for COCO functionality.

The tests comprehensively cover:

  • Import from disk and error handling for missing files
  • Label Studio task conversion
  • add_coco_annotation filename rewriting
  • _resolve_annotation_field edge cases (explicit, auto, no fields, alphabetical ordering)
  • Export with various configurations (bbox coordinates, empty annotations error, custom classes, custom filename, multiple datapoints)

The helper functions (_make_coco_json, _write_coco, _make_image_bbox, _make_qr) are well-structured and reusable.

tests/mocks/repo_api.py (1)

116-118: LGTM!

The mock id property correctly mirrors the real RepoAPI.id signature with a constant return value appropriate for test scenarios.

tests/data_engine/conftest.py (1)

8-8: LGTM!

Explicitly setting source_type = DatasourceType.REPOSITORY ensures the mocked datasource state correctly exercises REPOSITORY-specific code paths (e.g., path_parts() revision extraction), which is necessary for the new COCO annotation tests.

Also applies to: 29-29

dagshub/data_engine/model/query_result.py (3)

783-791: LGTM!

Good refactoring to centralize annotation field resolution. The alphabetical sorting provides deterministic selection when multiple annotation fields exist, and the error message is clear when no fields are found.


871-917: LGTM on overall export_as_coco structure.

The method follows the established pattern from export_as_yolo: resolves annotation field, collects annotations, prefixes filenames with source prefix, downloads images, and exports to the target format. Error handling for empty annotations is appropriate.


901-903: No changes required. The code correctly assigns dict(classes) to context.categories, which matches the documented COCO format specification of {id: name}. The imported Categories class is from the YOLO module and should not be used with CocoContext, as it is format-specific to YOLO, not COCO.

			> Likely an incorrect or invalid review comment.
dagshub/data_engine/annotation/importer.py (1)

160-163: Defensive handling for None filenames.

This change prevents TypeError when remap_func is called with None. The fallback to new_filename is reasonable since the annotation belongs to the image keyed by that filename in the dictionary.

However, a None filename from the converter could indicate a bug upstream. Consider logging when this fallback is triggered to aid debugging.

Comment on lines +45 to +49
# FIXME: roll back to main after merging
# "dagshub-annotation-converter>=0.1.12",
"dagshub-annotation-converter @ "
+ "git+https://github.com/DagsHub/"
+ "dagshub-annotation-converter@coco_converter#egg=dagshub-annotation-converter",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

VCS branch dependency is not suitable for release.

Pinning to a Git branch (coco_converter) in install_requires will break reproducibility and may fail for end users if the branch is rebased, force-pushed, or deleted. This dependency pattern is acceptable for local development but should not be merged to main.

Ensure the upstream dagshub-annotation-converter PR is merged and a versioned release is published before merging this PR, then revert to a version-pinned dependency (e.g., dagshub-annotation-converter>=0.1.13).

@deanp70 deanp70 requested a review from kbolashev March 29, 2026 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants