Skip to content

feed range query fix#46692

Open
dibahlfi wants to merge 19 commits into
hotfix/azure-cosmos_4.14.6from
users/dibahl/cosmos-4.14.7-backport
Open

feed range query fix#46692
dibahlfi wants to merge 19 commits into
hotfix/azure-cosmos_4.14.6from
users/dibahl/cosmos-4.14.7-backport

Conversation

@dibahlfi
Copy link
Copy Markdown
Member

@dibahlfi dibahlfi commented May 3, 2026

Summary
This PR makes Python Cosmos SDK query continuation tokens safe to round-trip across SDK versions during a rolling deployment, while fixing the silent split-bug class for feed_range and prefix partition_key queries.
The behavior is governed by a single rule, applied per request based on how many physical partitions the caller's input scope currently maps to in the PKRANGE cache:
• Currently 1 partition → emit the legacy single-string continuation (read and write). Old SDKs can read it; new SDKs can resume it.
• Currently >1 partition → emit the structured v=1 envelope (read and write). Required to safely represent one resume position per partition.
• Inbound legacy token, but the input scope now covers >1 partition (split happened since the token was minted) → restart that input range once, then emit v=1 from that page onward.
Full-PK queries are structurally always single-partition, so they always emit legacy format regardless of inbound — the strongest cross-version guarantee in the SDK.
Token formats
• Legacy — opaque single-string continuation the SDK has emitted for years. One position slot. Readable by any SDK version.
• v=1 envelope — base64-encoded JSON carrying one resume position per partition plus identity hashes. Only the new SDK can read it; old SDKs forward it verbatim and the service rejects it with 400 BadRequest.
Decoded v=1 shape(conceptual):
{
"v": 1,
"qh": "<hash of query text + parameters>",
"cr": "/dbs/.../colls/...",
"frh": "",
"c": [
{ "min": "...", "max": "...", "bc": "" },
...
]
}
qh / cr / frh let the SDK refuse a token replayed against a different query, container, or feed_range. On mismatch the SDK raises ValueError unwrapped to the caller.

Cross-SDK compatibility matrix

Direction Full-PK Feed-range / prefix-PK (single-partition input range) Feed-range / prefix-PK (multi-partition input range)
Old → Old Restart on split when service no longer honors parent-era continuation (today's behavior). Original silent split bug if a split happens before replay. Original silent split bug if a split happens before replay.
New → New Always legacy format; resume with legacy token; restart only if service rejects parent-era continuation after split (fallback restart). Legacy format; resume from saved position; restart only if a split crossed the input range and it is now multi-partition. v=1; resume per partition; split-safe.
New → Old Zero impact (always legacy format). Zero impact (new SDK emitted legacy format). 400 on the old reader.

Mid-page failure handling:
When the pagination loop is mid-page and a backend POST raises (any HTTP error, network failure, etc.), the SDK stamps a resumable checkpoint into last_response_headers[Continuation] before re-raising the original exception. The checkpoint shape follows the same per-request rule above: legacy if the input range currently maps to one partition (always, for full-PK), v=1 if currently more than one. The caller's by_page() loop sees the exception and can resume on the saved checkpoint.
Rollout safety
• Single-partition input ranges (the common case): zero impact across versions in either direction. Bookmarks stay legacy both ways.
• Full-PK: structurally single-partition forever. Strongest guarantee — persist a bookmark, replay from any SDK version, no 400.
• Multi-partition input ranges: bookmarks are v=1 and only readable by the new SDK. During rolling upgrade, an old pod that picks up a new pod's v=1 token loses that page with a 400 and clears-and-restarts.

dibahlfi added 2 commits May 2, 2026 22:15
* fix - fixing global endpoint issue

* fix - updating CHANGELOG

(cherry picked from commit 9eff742)
@dibahlfi
Copy link
Copy Markdown
Member Author

dibahlfi commented May 3, 2026

@sdkReviewAgent-2

@dibahlfi
Copy link
Copy Markdown
Member Author

dibahlfi commented May 3, 2026

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@dibahlfi dibahlfi changed the title Users/dibahl/cosmos 4.14.7 backport - conda error feed range query fix(new) May 3, 2026
@dibahlfi
Copy link
Copy Markdown
Member Author

dibahlfi commented May 3, 2026

@sdkReviewAgent-2

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_client_connection.py
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_base.py
Comment thread sdk/cosmos/azure-cosmos/CHANGELOG.md
return None


def _strip_sql_block_comments(query_text: str) -> str:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion — Correctness: SQL line comments (--) not stripped

_strip_sql_block_comments handles /* ... */ block comments but not -- line comments. The docstring honestly documents this ("removes only block comments"), but there's a concrete edge case:

SELECT VALUE COUNT(1) -- FROM other_table
FROM c

After whitespace normalization (" ".join(split())), this becomes:

SELECT VALUE COUNT(1) -- FROM OTHER_TABLE FROM C

_extract_outer_select_value_projection scans for FROM at depth 0 and would match FROM OTHER_TABLE inside the -- comment instead of the real FROM C, truncating the projection incorrectly.

Impact: Low — SDK-constructed queries rarely use -- comments. But adding a line-comment strip pass (scan for -- outside quotes, skip to next \n) would close this gap.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/aio/_cosmos_client_connection_async.py Outdated
# ``BAD_REQUEST``).
pagination_state = None
else:
pagination_state = _FeedRangePaginationState.from_derived_feedranges(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion — Maintainability: Sync/async formatting divergence signals independent editing

Minor inconsistency: the closing parenthesis here uses extra indentation compared to the sync counterpart in _cosmos_client_connection.py:

# Async (this file):
                    pagination_state = _FeedRangePaginationState.from_derived_feedranges(
                        all_feedranges,
                        page_size_hint,
                        )  # <-- indented further

# Sync:
                    pagination_state = _FeedRangePaginationState.from_derived_feedranges(
                        all_feedranges,
                        page_size_hint,
                    )  # <-- aligned with call

This is cosmetic, but the ~215 lines of near-identical pagination logic in each file (with only await/return differences) are a real maintenance risk. This formatting drift is a canary — it suggests these files are being edited independently rather than mechanically synchronized.

Consider extracting a few more pure sub-blocks into feed_range_continuation.py to reduce the duplicated surface:

  • The _checkpoint_and_reraise body (~15 lines identical)
  • The merge-with-fallback block (~12 lines identical)
  • The legacy-bridge header restoration (~8 lines identical)

This would reduce the sync/async clone from ~215 to ~140 lines.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

@xinlian12
Copy link
Copy Markdown
Member

Review complete (51:24)

Posted 7 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@scbedd
Copy link
Copy Markdown
Member

scbedd commented May 4, 2026

/azp run python - pullrequest

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@scbedd
Copy link
Copy Markdown
Member

scbedd commented May 4, 2026

/azp run python - cosmos - ci

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@dibahlfi
Copy link
Copy Markdown
Member Author

dibahlfi commented May 4, 2026

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@dibahlfi
Copy link
Copy Markdown
Member Author

dibahlfi commented May 5, 2026

@sdkReviewAgent-2

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_client_connection.py
Comment thread sdk/cosmos/azure-cosmos/tests/test_partition_split_retry_unit.py
@xinlian12
Copy link
Copy Markdown
Member

Review complete (57:42)

Posted 2 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

tvaron3 and others added 3 commits May 13, 2026 10:49
Service-side error messages changed wording (e.g., 'Full Text Policy' ->
'full-text policy', 'abstract' -> "'abstract'"). Replace 3 brittle
string-match assertions with the existing _assert_message_contains
helper, which normalizes case, single quotes, and hyphens. The async
test file already uses this helper consistently; this brings the sync
file in line.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Remove remaining imports of pkg_resources
* Restore comment explaining opentelemetry_version + cleanup
tvaron3 and others added 2 commits May 13, 2026 16:52
Some packages (azure-ai-ml, azure-ai-evaluation) have setup.py files
that import pkg_resources at module level. When the build venv has
setuptools>=80 (where pkg_resources is gone), exec'ing those setup.py
files raises ModuleNotFoundError, which discover_targeted_packages did
not catch. This aborted the entire 'Ensure service coverage' regression-
matrix step for cosmos hotfix builds.

Broaden the exception handler from RuntimeError to Exception so a single
bad setup.py omits just that package from discovery instead of aborting
the run. The error reason is logged for traceability.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tvaron3 and others added 4 commits May 13, 2026 17:01
The conda assembly's create_combined_sdist function looks for the
sdist file produced by 'python setup.py sdist' inside the assembled
folder. With setuptools >= 69 the sdist filename is normalized per
PEP 625 (e.g. 'azure_core-1.35.0.tar.gz'), so the substring check
'"azure-core" in a' no longer matches and the next() call raises
StopIteration, aborting the entire 'Assemble Conda Packages' step.

Match the canonicalized form by replacing '-' with '_' before the
substring check, identical to the fix already in main (PR #44009).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The conda 'env create' step now reaches conda.anaconda.org, which is
not in the allowlist for the network-isolated 1ES build agents
(requests get sinkholed to 192.0.2.11). Switch the conda env channel
to https://prefix.dev/conda-forge, matching the workaround applied
upstream on the conda_release_patch branch (commit 1b19167).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
azure-ai-ml is the only conda recipe in the conda-sdk-client pipeline
with an explicit additional channel (conda-forge). With network
isolation blocking conda.anaconda.org, conda-build fetches against
that channel fail with 'HTTP 000 CONNECTION FAILED'. Switch to the
prefix.dev mirror so the package's transitive dependencies resolve
through the same allowlisted host the build env already uses.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
setuptools 81+ removed pkg_resources, which both azure-ai-ml/setup.py
and azure-ai-evaluation/setup.py still import unconditionally. The conda
build env now ships setuptools 82.0.1 from prefix.dev/conda-forge, so
`pip install .` fails with ModuleNotFoundError: No module named
'pkg_resources'.

Pin setuptools<70 in host requirements and use --no-build-isolation so
pip uses the host env's older setuptools (which still bundles
pkg_resources) instead of pulling latest into an isolated build env.

This is a conda-recipe-only change so it does not trigger pull request
CI for the azure-ai-ml or azure-ai-evaluation packages themselves.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3
Copy link
Copy Markdown
Member

tvaron3 commented May 14, 2026

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

8 participants