fix: make Document.id deterministic regardless of meta key order#11446
fix: make Document.id deterministic regardless of meta key order#11446Aarkin7 wants to merge 3 commits into
Conversation
The hash was built from dict's repr, which reflects insertion order, so two Documents with equal meta could get different IDs. Serialize meta with sort_keys=True before hashing. Empty-meta IDs are unchanged.
|
@Aarkin7 is attempting to deploy a commit to the deepset Team on Vercel. A member of the Team first needs to authorize it. |
Two BDD scenarios pinned IDs that were computed from documents with non-empty meta, so the deterministic-id fix changes them. Recompute and update the expected values; no behavior change.
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||
biswajeetdev
left a comment
There was a problem hiding this comment.
The core fix is right — str(dict) includes Python insertion order in its output, so two Document objects with identical content and logically identical meta dicts could get different IDs just because the dicts were constructed differently. json.dumps(..., sort_keys=True) produces a canonical string that is order-independent.
One thing worth calling out in the release note: the default=str argument also affects documents with non-JSON-serialisable meta values (e.g. datetime objects, custom classes). Previously str({"date": datetime(2024,1,1)}) produced "datetime.datetime(2024, 1, 1, 0, 0)" in the hash; now json.dumps(..., default=str) produces "2024-01-01 00:00:00". These are different strings, so any documents with non-serialisable meta fields will also get new IDs — not just documents where key order varied. The release note currently says only non-empty-meta documents are affected, which is true but understates the impact for users with datetime or custom-type meta.
The if self.meta else "{}" guard correctly matches str({}) == "{}" for empty-dict meta, so the "empty meta is unaffected" claim holds.
|
Thanks for catching this. You're right that the impact wording was too narrow. Updated the upgrade note to spell out both cases explicitly:
|
Related Issues
Proposed Changes:
Fixes a bug where Document.id depended on the insertion order of keys in meta. The hash was built from dict's repr, which reflects insertion order, so two Documents with the same content and the same metadata could end up with different IDs, silently breaking DuplicatePolicy.SKIP/FAIL and any cache keyed on the document ID.
Serializing meta with json.dumps(..., sort_keys=True) before hashing makes the ID order-independent. Empty-meta IDs are
preserved by keeping the legacy "{}" string, so only documents with non-empty meta get new IDs.
How did you test it?
Notes for the reviewer
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:and added!in case the PR includes breaking changes.