feat: expose layout confidence metadata#4356
Conversation
There was a problem hiding this comment.
1 issue found across 5 files
Shadow auto-approve: would not auto-approve because issues were found.
Fix all with cubic | Re-trigger cubic
|
Addressed the issue identified by cubic in 1bb2ee3: empty string layout confidence values are ignored before float conversion, while 0.0 is still preserved. Added regression coverage for empty prob metadata. |
There was a problem hiding this comment.
0 issues found across 2 files (changes from recent commits).
Shadow auto-approve: would require human review. This PR modifies core metadata fields and serialization logic, altering how confidence scores are populated and potentially affecting downstream consumers; such changes to central data structures require human review.
Re-trigger cubic
Summary
ElementMetadata.confidence_scoreandElementMetadata.extraction_methodas serialized, downstream-facing layout metadata fields.confidence_scorefrom layoutprobwhile keeping the existingdetection_class_probfield for compatibility.prob=0.0instead of dropping it through a truthiness check, which keeps the most important low-confidence signal available to RAG/review pipelines.confidence_scorein staging output likedetection_class_prob.Why
Issue #4320 asks for element-level confidence metadata so callers can filter uncertain extractions, route low-confidence elements to review, and weight downstream retrieval. Unstructured already carries model probabilities internally as
detection_class_prob; this PR exposes the same signal under a clearerconfidence_scorefield and adds the extraction source/method alongside it.This is a scoped first step: it surfaces confidence for layout elements that already provide
prob; it does not attempt to calibrate confidence across every partitioner/backend.Testing
.venv/bin/python -m pytest test_unstructured/partition/common/test_common.py::test_normalize_layout_element_preserves_layout_confidence_metadata test_unstructured/partition/common/test_common.py::test_normalize_layout_element_layout_element_text_source_metadata test_unstructured/staging/test_base.py::test_default_pandas_dtypes test_unstructured/documents/test_elements.py::DescribeElementMetadata::it_can_find_the_consolidation_strategy_for_each_of_its_known_fieldspython3 -m ruff check unstructured/documents/elements.py unstructured/partition/common/common.py unstructured/staging/base.py test_unstructured/partition/common/test_common.py test_unstructured/staging/test_base.pySummary by cubic
Expose element-level confidence metadata by adding
confidence_scoreandextraction_methodtoElementMetadata. These come from layoutprobandsource, preserve 0.0, ignore empty strings, and flow through serialization, staging, and pandas exports.confidence_scorefrom layoutprob; keepdetection_class_probfor compatibility.prob=0.0; ignore empty string values.extraction_methodfrom layoutsource(handles enums/strings) and mirror it todetection_origin.confidence_scorein staging likedetection_class_prob.Written for commit 1bb2ee3. Summary will update on new commits. Review in cubic