Skip to content

[Query]: Adds ability to choose global vs local/focused statistics for FullTextScore#45686

Open
aayush3011 wants to merge 7 commits intoAzure:mainfrom
aayush3011:users/akataria/fullTextImprovements
Open

[Query]: Adds ability to choose global vs local/focused statistics for FullTextScore#45686
aayush3011 wants to merge 7 commits intoAzure:mainfrom
aayush3011:users/akataria/fullTextImprovements

Conversation

@aayush3011
Copy link
Member

Description

Why?

Cosmos DB's implementation of FullTextScore computes BM25 statistics (term frequency, inverse document frequency, and document length) across all documents in the container, including all physical and logical partitions.

While this provides a valid and comprehensive representation of statistics for the entire dataset, it introduces challenges for several common use cases:

  • Multi-tenant scenarios: Tenants often operate in very different domains, which can significantly change the distribution and importance of keywords. Using global statistics leads to distorted relevance rankings for individual tenants.
  • Large containers with many partitions: Computing statistics across hundreds or thousands of physical partitions can be time-consuming and expensive. Customers may prefer statistics derived from only a subset of partitions to improve performance and reduce RU consumption.

This is the Python SDK port of the .NET SDK PR: Azure/azure-cosmos-dotnet-v3#5582

What?

This PR extends the flexibility of BM25 scoring so that developers can choose between:

  • Global (default): FullTextScore computes BM25 statistics across all documents in the container, regardless of any partition key filters. This is the existing behavior.
  • Local: When a query includes a partition key filter, BM25 statistics are computed only over the subset of documents within the specified partition key values. Scores and ranking reflect relevance within that partition-specific slice of data.

How?

A new full_text_score_scope keyword argument is added to query_items():

   items = container.query_items(
       query="SELECT TOP 10 * FROM c WHERE c.tenantId = @tenantId ORDER BY RANK FullTextScore(c.text, 'keywords')",
       parameters=[{"name": "@tenantId", "value": tenant_id}],
       partition_key=tenant_id,
       full_text_score_scope="Local"  # or "Global" (default)
   )

When full_text_score_scope="Local", the hybrid search aggregator uses only the query's target partition ranges (instead of all ranges) when executing the global statistics query. This is a client-side only change, no new HTTP headers are sent to the backend.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@aayush3011 aayush3011 requested a review from a team as a code owner March 13, 2026 17:20
Copilot AI review requested due to automatic review settings March 13, 2026 17:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an opt-in way to scope BM25 statistics used by FullTextScore in Cosmos DB hybrid search, allowing callers to choose between global container-wide statistics and “local” statistics limited to the query’s target partition ranges.

Changes:

  • Added full_text_score_scope kwarg to query_items() (sync + async) with validation and docs.
  • Updated hybrid search aggregators to scope global statistics queries to either all ranges (Global/default) or target ranges (Local).
  • Added sync/async test coverage and updated the package changelog.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
sdk/cosmos/azure-cosmos/azure/cosmos/container.py Adds full_text_score_scope kwarg, validates values, documents behavior, and passes the option into query feed options.
sdk/cosmos/azure-cosmos/azure/cosmos/aio/_container.py Async equivalent of full_text_score_scope kwarg support (validation + docs + feed option).
sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/hybrid_search_aggregator.py Uses fullTextScoreScope option to decide whether global statistics queries target all partition ranges or only query target ranges.
sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/aio/hybrid_search_aggregator.py Async equivalent of local vs global partition-range selection for statistics queries.
sdk/cosmos/azure-cosmos/tests/test_query_hybrid_search.py Adds sync tests for Global vs Local scope behavior.
sdk/cosmos/azure-cosmos/tests/test_query_hybrid_search_async.py Adds async tests for Global vs Local scope behavior.
sdk/cosmos/azure-cosmos/CHANGELOG.md Documents the new full_text_score_scope parameter in the unreleased section.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Member

@simorenoh simorenoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot had some comments that may be worthwhile on the tests added - LGTM otherwise

@aayush3011
Copy link
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants