Skip to content

feat: expose layout confidence metadata#4356

Open
RitwijParmar wants to merge 2 commits into
Unstructured-IO:mainfrom
RitwijParmar:codex/unstructured-layout-confidence-metadata
Open

feat: expose layout confidence metadata#4356
RitwijParmar wants to merge 2 commits into
Unstructured-IO:mainfrom
RitwijParmar:codex/unstructured-layout-confidence-metadata

Conversation

@RitwijParmar

@RitwijParmar RitwijParmar commented May 26, 2026

Copy link
Copy Markdown

Summary

  • Add ElementMetadata.confidence_score and ElementMetadata.extraction_method as serialized, downstream-facing layout metadata fields.
  • Populate confidence_score from layout prob while keeping the existing detection_class_prob field for compatibility.
  • Preserve prob=0.0 instead of dropping it through a truthiness check, which keeps the most important low-confidence signal available to RAG/review pipelines.
  • Include the new fields in metadata consolidation/dtype handling and round confidence_score in staging output like detection_class_prob.

Why

Issue #4320 asks for element-level confidence metadata so callers can filter uncertain extractions, route low-confidence elements to review, and weight downstream retrieval. Unstructured already carries model probabilities internally as detection_class_prob; this PR exposes the same signal under a clearer confidence_score field and adds the extraction source/method alongside it.

This is a scoped first step: it surfaces confidence for layout elements that already provide prob; it does not attempt to calibrate confidence across every partitioner/backend.

Testing

  • .venv/bin/python -m pytest test_unstructured/partition/common/test_common.py::test_normalize_layout_element_preserves_layout_confidence_metadata test_unstructured/partition/common/test_common.py::test_normalize_layout_element_layout_element_text_source_metadata test_unstructured/staging/test_base.py::test_default_pandas_dtypes test_unstructured/documents/test_elements.py::DescribeElementMetadata::it_can_find_the_consolidation_strategy_for_each_of_its_known_fields
  • python3 -m ruff check unstructured/documents/elements.py unstructured/partition/common/common.py unstructured/staging/base.py test_unstructured/partition/common/test_common.py test_unstructured/staging/test_base.py

Summary by cubic

Expose element-level confidence metadata by adding confidence_score and extraction_method to ElementMetadata. These come from layout prob and source, preserve 0.0, ignore empty strings, and flow through serialization, staging, and pandas exports.

  • New Features
    • Populate confidence_score from layout prob; keep detection_class_prob for compatibility.
    • Preserve prob=0.0; ignore empty string values.
    • Serialize extraction_method from layout source (handles enums/strings) and mirror it to detection_origin.
    • Round confidence_score in staging like detection_class_prob.
    • Add both fields to consolidation strategies and default pandas dtypes.

Written for commit 1bb2ee3. Summary will update on new commits. Review in cubic

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files

Shadow auto-approve: would not auto-approve because issues were found.

Fix all with cubic | Re-trigger cubic

Comment thread unstructured/partition/common/common.py Outdated
@RitwijParmar

Copy link
Copy Markdown
Author

Addressed the issue identified by cubic in 1bb2ee3: empty string layout confidence values are ignored before float conversion, while 0.0 is still preserved. Added regression coverage for empty prob metadata.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 2 files (changes from recent commits).

Shadow auto-approve: would require human review. This PR modifies core metadata fields and serialization logic, altering how confidence scores are populated and potentially affecting downstream consumers; such changes to central data structures require human review.

Re-trigger cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant