Skip to content

fix: make Document.id deterministic regardless of meta key order#11446

Open
Aarkin7 wants to merge 3 commits into
deepset-ai:mainfrom
Aarkin7:fix/document-id-deterministic-hashing
Open

fix: make Document.id deterministic regardless of meta key order#11446
Aarkin7 wants to merge 3 commits into
deepset-ai:mainfrom
Aarkin7:fix/document-id-deterministic-hashing

Conversation

@Aarkin7
Copy link
Copy Markdown
Contributor

@Aarkin7 Aarkin7 commented May 31, 2026

Related Issues

Proposed Changes:

Fixes a bug where Document.id depended on the insertion order of keys in meta. The hash was built from dict's repr, which reflects insertion order, so two Documents with the same content and the same metadata could end up with different IDs, silently breaking DuplicatePolicy.SKIP/FAIL and any cache keyed on the document ID.

Serializing meta with json.dumps(..., sort_keys=True) before hashing makes the ID order-independent. Empty-meta IDs are
preserved by keeping the legacy "{}" string, so only documents with non-empty meta get new IDs.

How did you test it?

  • Added two regression tests in test/dataclasses/test_document.py covering flat and nested meta key orderings.
  • Updated two stale hardcoded IDs in existing tests (test_init_with_parameters, test_init_with_legacy_field) since the hash changes for non-empty meta.
  • hatch run test:unit → 5135 passed, 0 failed.
  • hatch run test:types → clean.
  • hatch run fmt → clean.

Notes for the reviewer

  • Breaking-ish: any auto-generated Document.id for documents with non-empty meta changes. The release note's upgrade section calls this out and tells users how to handle it (re-ingest, or pass id explicitly). Empty-meta IDs are intentionally preserved to minimize churn.
  • Scope was kept tight: the bug report mentions embedding and sparse_embedding hashing as related concerns. SparseEmbedding.to_dict() is order-stable today via dataclass field order, and float-embedding nondeterminism isn't fixable at this layer, so this PR sticks to the actual reported root cause.
  • default=str in json.dumps is a safety net for the (unsupported but tolerated) case of non-JSON-serializable values in meta, matching the prior behavior of not crashing on such inputs.

Checklist

  • I have read the contributors guidelines and the code of conduct.
  • I have updated the related issue with new insights and changes.
  • I have added unit tests and updated the docstrings.
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I have documented my code.
  • I have added a release note file, following the contributors guidelines.
  • I have run pre-commit hooks and fixed any issue.

The hash was built from dict's repr, which reflects insertion order, so two Documents with equal meta could get different IDs. Serialize meta with sort_keys=True before hashing. Empty-meta IDs are unchanged.
@Aarkin7 Aarkin7 requested a review from a team as a code owner May 31, 2026 18:02
@Aarkin7 Aarkin7 requested review from julian-risch and removed request for a team May 31, 2026 18:02
@vercel
Copy link
Copy Markdown

vercel Bot commented May 31, 2026

@Aarkin7 is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

Two BDD scenarios pinned IDs that were computed from documents with non-empty meta, so the deterministic-id fix changes them. Recompute and update the expected values; no behavior change.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 31, 2026

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  haystack/dataclasses
  document.py
Project Total  

This report was generated by python-coverage-comment-action

Copy link
Copy Markdown

@biswajeetdev biswajeetdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core fix is right — str(dict) includes Python insertion order in its output, so two Document objects with identical content and logically identical meta dicts could get different IDs just because the dicts were constructed differently. json.dumps(..., sort_keys=True) produces a canonical string that is order-independent.

One thing worth calling out in the release note: the default=str argument also affects documents with non-JSON-serialisable meta values (e.g. datetime objects, custom classes). Previously str({"date": datetime(2024,1,1)}) produced "datetime.datetime(2024, 1, 1, 0, 0)" in the hash; now json.dumps(..., default=str) produces "2024-01-01 00:00:00". These are different strings, so any documents with non-serialisable meta fields will also get new IDs — not just documents where key order varied. The release note currently says only non-empty-meta documents are affected, which is true but understates the impact for users with datetime or custom-type meta.

The if self.meta else "{}" guard correctly matches str({}) == "{}" for empty-dict meta, so the "empty meta is unaffected" claim holds.

@github-actions github-actions Bot added the type:documentation Improvements on the docs label Jun 2, 2026
@Aarkin7
Copy link
Copy Markdown
Contributor Author

Aarkin7 commented Jun 2, 2026

Thanks for catching this. You're right that the impact wording was too narrow. Updated the upgrade note to spell out both cases explicitly:

  • documents with non-empty meta (repr -> JSON serialization change)
  • documents whose meta contains non-JSON-serializable values like datetime or custom classes (now serialized via str(...) instead of repr(...) thanks to default=str)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: make Document.id deterministic regardless of meta key order

2 participants