feat: expose layout confidence metadata by RitwijParmar · Pull Request #4356 · Unstructured-IO/unstructured

RitwijParmar · 2026-05-26T18:32:31Z

Summary

Add ElementMetadata.confidence_score and ElementMetadata.extraction_method as serialized, downstream-facing layout metadata fields.
Populate confidence_score from layout prob while keeping the existing detection_class_prob field for compatibility.
Preserve prob=0.0 instead of dropping it through a truthiness check, which keeps the most important low-confidence signal available to RAG/review pipelines.
Include the new fields in metadata consolidation/dtype handling and round confidence_score in staging output like detection_class_prob.

Why

Issue #4320 asks for element-level confidence metadata so callers can filter uncertain extractions, route low-confidence elements to review, and weight downstream retrieval. Unstructured already carries model probabilities internally as detection_class_prob; this PR exposes the same signal under a clearer confidence_score field and adds the extraction source/method alongside it.

This is a scoped first step: it surfaces confidence for layout elements that already provide prob; it does not attempt to calibrate confidence across every partitioner/backend.

Testing

.venv/bin/python -m pytest test_unstructured/partition/common/test_common.py::test_normalize_layout_element_preserves_layout_confidence_metadata test_unstructured/partition/common/test_common.py::test_normalize_layout_element_layout_element_text_source_metadata test_unstructured/staging/test_base.py::test_default_pandas_dtypes test_unstructured/documents/test_elements.py::DescribeElementMetadata::it_can_find_the_consolidation_strategy_for_each_of_its_known_fields
python3 -m ruff check unstructured/documents/elements.py unstructured/partition/common/common.py unstructured/staging/base.py test_unstructured/partition/common/test_common.py test_unstructured/staging/test_base.py

Summary by cubic

Expose element-level confidence metadata by adding confidence_score and extraction_method to ElementMetadata. These come from layout prob and source, preserve 0.0, ignore empty strings, and flow through serialization, staging, and pandas exports.

New Features
- Populate confidence_score from layout prob; keep detection_class_prob for compatibility.
- Preserve prob=0.0; ignore empty string values.
- Serialize extraction_method from layout source (handles enums/strings) and mirror it to detection_origin.
- Round confidence_score in staging like detection_class_prob.
- Add both fields to consolidation strategies and default pandas dtypes.

^{Written for commit 1bb2ee3. Summary will update on new commits. Review in cubic}

cubic-dev-ai

1 issue found across 5 files

_{Shadow auto-approve: would not auto-approve because issues were found.

Fix all with cubic | Re-trigger cubic}

RitwijParmar · 2026-05-26T20:47:00Z

Addressed the issue identified by cubic in 1bb2ee3: empty string layout confidence values are ignored before float conversion, while 0.0 is still preserved. Added regression coverage for empty prob metadata.

cubic-dev-ai

0 issues found across 2 files (changes from recent commits).

_{Shadow auto-approve: would require human review. This PR modifies core metadata fields and serialization logic, altering how confidence scores are populated and potentially affecting downstream consumers; such changes to central data structures require human review.

Re-trigger cubic}

feat: expose layout confidence metadata

9b7226d

cubic-dev-ai Bot reviewed May 26, 2026

View reviewed changes

Comment thread unstructured/partition/common/common.py Outdated

Ignore empty layout confidence values

1bb2ee3

cubic-dev-ai Bot reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: expose layout confidence metadata#4356

feat: expose layout confidence metadata#4356
RitwijParmar wants to merge 2 commits into
Unstructured-IO:mainfrom
RitwijParmar:codex/unstructured-layout-confidence-metadata

RitwijParmar commented May 26, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

RitwijParmar commented May 26, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

RitwijParmar commented May 26, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Testing

Summary by cubic

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RitwijParmar commented May 26, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RitwijParmar commented May 26, 2026 •

edited by cubic-dev-ai Bot

Loading

cubic-dev-ai Bot left a comment •

edited

Loading