feed range query fix by dibahlfi · Pull Request #46692 · Azure/azure-sdk-for-python

dibahlfi · 2026-05-03T21:48:11Z

Summary
This PR makes Python Cosmos SDK query continuation tokens safe to round-trip across SDK versions during a rolling deployment, while fixing the silent split-bug class for feed_range and prefix partition_key queries.
The behavior is governed by a single rule, applied per request based on how many physical partitions the caller's input scope currently maps to in the PKRANGE cache:
• Currently 1 partition → emit the legacy single-string continuation (read and write). Old SDKs can read it; new SDKs can resume it.
• Currently >1 partition → emit the structured v=1 envelope (read and write). Required to safely represent one resume position per partition.
• Inbound legacy token, but the input scope now covers >1 partition (split happened since the token was minted) → restart that input range once, then emit v=1 from that page onward.
Full-PK queries are structurally always single-partition, so they always emit legacy format regardless of inbound — the strongest cross-version guarantee in the SDK.
Token formats
• Legacy — opaque single-string continuation the SDK has emitted for years. One position slot. Readable by any SDK version.
• v=1 envelope — base64-encoded JSON carrying one resume position per partition plus identity hashes. Only the new SDK can read it; old SDKs forward it verbatim and the service rejects it with 400 BadRequest.
Decoded v=1 shape(conceptual):
{
"v": 1,
"qh": "<hash of query text + parameters>",
"cr": "/dbs/.../colls/...",
"frh": "",
"c": [
{ "min": "...", "max": "...", "bc": "" },
...
]
}
qh / cr / frh let the SDK refuse a token replayed against a different query, container, or feed_range. On mismatch the SDK raises ValueError unwrapped to the caller.

Cross-SDK compatibility matrix

Direction	Full-PK	Feed-range / prefix-PK (single-partition input range)	Feed-range / prefix-PK (multi-partition input range)
Old → Old	Restart on split when service no longer honors parent-era continuation (today's behavior).	Original silent split bug if a split happens before replay.	Original silent split bug if a split happens before replay.
New → New	Always legacy format; resume with legacy token; restart only if service rejects parent-era continuation after split (fallback restart).	Legacy format; resume from saved position; restart only if a split crossed the input range and it is now multi-partition.	v=1; resume per partition; split-safe.
New → Old	Zero impact (always legacy format).	Zero impact (new SDK emitted legacy format).	400 on the old reader.

Mid-page failure handling:
When the pagination loop is mid-page and a backend POST raises (any HTTP error, network failure, etc.), the SDK stamps a resumable checkpoint into last_response_headers[Continuation] before re-raising the original exception. The checkpoint shape follows the same per-request rule above: legacy if the input range currently maps to one partition (always, for full-PK), v=1 if currently more than one. The caller's by_page() loop sees the exception and can resume on the saved checkpoint.
Rollout safety
• Single-partition input ranges (the common case): zero impact across versions in either direction. Bookmarks stay legacy both ways.
• Full-PK: structurally single-partition forever. Strongest guarantee — persist a bookmark, replay from any SDK version, no 400.
• Multi-partition input ranges: bookmarks are v=1 and only readable by the new SDK. During rolling upgrade, an old pod that picks up a new pod's v=1 token loses that page with a 400 and clears-and-restarts.

* fix - fixing global endpoint issue * fix - updating CHANGELOG (cherry picked from commit 9eff742)

dibahlfi · 2026-05-03T21:59:38Z

@sdkReviewAgent-2

dibahlfi · 2026-05-03T22:00:13Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-03T22:00:32Z

Azure Pipelines successfully started running 1 pipeline(s).

dibahlfi · 2026-05-03T22:51:34Z

@sdkReviewAgent-2

xinlian12 · 2026-05-04T05:51:17Z

+    return None
+
+
+def _strip_sql_block_comments(query_text: str) -> str:


🟢 Suggestion — Correctness: SQL line comments (--) not stripped

_strip_sql_block_comments handles /* ... */ block comments but not -- line comments. The docstring honestly documents this ("removes only block comments"), but there's a concrete edge case:

SELECT VALUE COUNT(1) -- FROM other_table FROM c

After whitespace normalization (" ".join(split())), this becomes:

SELECT VALUE COUNT(1) -- FROM OTHER_TABLE FROM C

_extract_outer_select_value_projection scans for FROM at depth 0 and would match FROM OTHER_TABLE inside the -- comment instead of the real FROM C, truncating the projection incorrectly.

Impact: Low — SDK-constructed queries rarely use -- comments. But adding a line-comment strip pass (scan for -- outside quotes, skip to next \n) would close this gap.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

xinlian12 · 2026-05-04T05:51:40Z

+                    # ``BAD_REQUEST``).
+                    pagination_state = None
+                else:
+                    pagination_state = _FeedRangePaginationState.from_derived_feedranges(


🟢 Suggestion — Maintainability: Sync/async formatting divergence signals independent editing

Minor inconsistency: the closing parenthesis here uses extra indentation compared to the sync counterpart in _cosmos_client_connection.py:

# Async (this file): pagination_state = _FeedRangePaginationState.from_derived_feedranges( all_feedranges, page_size_hint, ) # <-- indented further # Sync: pagination_state = _FeedRangePaginationState.from_derived_feedranges( all_feedranges, page_size_hint, ) # <-- aligned with call

This is cosmetic, but the ~215 lines of near-identical pagination logic in each file (with only await/return differences) are a real maintenance risk. This formatting drift is a canary — it suggests these files are being edited independently rather than mechanically synchronized.

Consider extracting a few more pure sub-blocks into feed_range_continuation.py to reduce the duplicated surface:

The _checkpoint_and_reraise body (~15 lines identical)

The merge-with-fallback block (~12 lines identical)

The legacy-bridge header restoration (~8 lines identical)

This would reduce the sync/async clone from ~215 to ~140 lines.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

xinlian12 · 2026-05-04T05:51:51Z

✅ Review complete (51:24)

Posted 7 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

scbedd · 2026-05-04T19:06:04Z

/azp run python - pullrequest

azure-pipelines · 2026-05-04T19:06:13Z

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

scbedd · 2026-05-04T19:11:24Z

/azp run python - cosmos - ci

azure-pipelines · 2026-05-04T19:12:36Z

Azure Pipelines successfully started running 1 pipeline(s).

dibahlfi · 2026-05-04T22:02:47Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-04T22:03:03Z

Azure Pipelines successfully started running 1 pipeline(s).

dibahlfi · 2026-05-05T01:51:34Z

@sdkReviewAgent-2

xinlian12 · 2026-05-05T02:50:19Z

✅ Review complete (57:42)

Posted 2 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

dibahlfi · 2026-05-13T16:11:44Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-13T16:12:02Z

Azure Pipelines successfully started running 1 pipeline(s).

Service-side error messages changed wording (e.g., 'Full Text Policy' -> 'full-text policy', 'abstract' -> "'abstract'"). Replace 3 brittle string-match assertions with the existing _assert_message_contains helper, which normalizes case, single quotes, and hyphens. The async test file already uses this helper consistently; this brings the sync file in line. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Remove remaining imports of pkg_resources * Restore comment explaining opentelemetry_version + cleanup

This reverts commit ea824cf.

Some packages (azure-ai-ml, azure-ai-evaluation) have setup.py files that import pkg_resources at module level. When the build venv has setuptools>=80 (where pkg_resources is gone), exec'ing those setup.py files raises ModuleNotFoundError, which discover_targeted_packages did not catch. This aborted the entire 'Ensure service coverage' regression- matrix step for cosmos hotfix builds. Broaden the exception handler from RuntimeError to Exception so a single bad setup.py omits just that package from discovery instead of aborting the run. The error reason is logged for traceability. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The conda assembly's create_combined_sdist function looks for the sdist file produced by 'python setup.py sdist' inside the assembled folder. With setuptools >= 69 the sdist filename is normalized per PEP 625 (e.g. 'azure_core-1.35.0.tar.gz'), so the substring check '"azure-core" in a' no longer matches and the next() call raises StopIteration, aborting the entire 'Assemble Conda Packages' step. Match the canonicalized form by replacing '-' with '_' before the substring check, identical to the fix already in main (PR #44009). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The conda 'env create' step now reaches conda.anaconda.org, which is not in the allowlist for the network-isolated 1ES build agents (requests get sinkholed to 192.0.2.11). Switch the conda env channel to https://prefix.dev/conda-forge, matching the workaround applied upstream on the conda_release_patch branch (commit 1b19167). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

azure-ai-ml is the only conda recipe in the conda-sdk-client pipeline with an explicit additional channel (conda-forge). With network isolation blocking conda.anaconda.org, conda-build fetches against that channel fail with 'HTTP 000 CONNECTION FAILED'. Switch to the prefix.dev mirror so the package's transitive dependencies resolve through the same allowlisted host the build env already uses. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

setuptools 81+ removed pkg_resources, which both azure-ai-ml/setup.py and azure-ai-evaluation/setup.py still import unconditionally. The conda build env now ships setuptools 82.0.1 from prefix.dev/conda-forge, so `pip install .` fails with ModuleNotFoundError: No module named 'pkg_resources'. Pin setuptools<70 in host requirements and use --no-build-isolation so pip uses the host env's older setuptools (which still bundles pkg_resources) instead of pulling latest into an isolated build env. This is a conda-recipe-only change so it does not trigger pull request CI for the azure-ai-ml or azure-ai-evaluation packages themselves. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

tvaron3 · 2026-05-14T07:18:40Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-14T07:18:58Z

Azure Pipelines successfully started running 1 pipeline(s).

dibahlfi added 2 commits May 2, 2026 22:15

fix - fixing global endpoint issue (#45200)

5266479

* fix - fixing global endpoint issue * fix - updating CHANGELOG (cherry picked from commit 9eff742)

Backport feed-range continuation and split-resume fixes for 4.14.7

cc412f0

dibahlfi requested a review from a team as a code owner May 3, 2026 21:48

github-actions Bot added the Cosmos label May 3, 2026

github-project-automation Bot added this to CosmosDB Python Eco-System May 3, 2026

dibahlfi changed the title ~~Users/dibahl/cosmos 4.14.7 backport - conda error~~ feed range query fix(new) May 3, 2026