Skip to content

release-26.2: sql/stats: decouple canary_stats_mode from canary_fraction cluster setting#168295

Open
blathers-crl[bot] wants to merge 1 commit intorelease-26.2from
blathers/backport-release-26.2-168273
Open

release-26.2: sql/stats: decouple canary_stats_mode from canary_fraction cluster setting#168295
blathers-crl[bot] wants to merge 1 commit intorelease-26.2from
blathers/backport-release-26.2-168273

Conversation

@blathers-crl
Copy link
Copy Markdown

@blathers-crl blathers-crl bot commented Apr 13, 2026

Backport 1/1 commits from #168273 on behalf of @ZhouXing19.


Previously, the canary stats rollout was globally gated by the sql.stats.canary_fraction cluster setting: if the fraction was 0, all sessions got StatsRolloutDefault regardless of their canary_stats_mode session variable. This meant that a user who just wanted to try canary stats in a single session had to set a non-zero canary_fraction cluster-wide, enrolling all sessions in the experiment. As discussed here, this might be too cumbersome for users to try out this feature.

This commit moves the canary_fraction check inside the "auto" case of canaryRollDice, so that the explicit session modes work independently of the cluster setting. The session variable values are renamed from "off"/"on" to "force_stable"/"force_canary" to clarify that they are unconditional overrides.

The defaults remain auto + fraction=0, so canary stats rollout is still off by default. Users must intentionally opt in, either per-session (SET canary_stats_mode = force_canary) or cluster-wide (SET CLUSTER SETTING sql.stats.canary_fraction > 0).

The PREPARE path guard is updated to also force StatsRolloutStable when the session mode is force_canary or force_stable (not just when canary_fraction > 0), ensuring prepared statement memo caching works correctly with the per-session overrides.

Epic: none
Release note (sql change): The canary_stats_mode session variable values are renamed from "off"/"on" to "force_stable"/"force_canary". These modes now work independently of the sql.stats.canary_fraction cluster setting, allowing per-session opt-in without cluster-wide enrollment.


Release justification: makes a customer requested feature easier for experimental usage

…tting

Previously, the canary stats rollout was globally gated by the
sql.stats.canary_fraction cluster setting: if the fraction was 0, all
sessions got StatsRolloutDefault regardless of their canary_stats_mode
session variable. This meant that a user who just wanted to try canary
stats in a single session had to set a non-zero canary_fraction
cluster-wide, enrolling all sessions in the experiment.

This commit moves the canary_fraction check inside the "auto" case of
canaryRollDice, so that the explicit session modes work independently
of the cluster setting. The session variable values are renamed from
"off"/"on" to "force_stable"/"force_canary" to clarify that they are
unconditional overrides.

The defaults remain auto + fraction=0, so canary stats rollout is
still off by default. Users must intentionally opt in, either per-session
(SET canary_stats_mode = force_canary) or cluster-wide (SET CLUSTER
SETTING sql.stats.canary_fraction > 0).

The PREPARE path guard is updated to also force StatsRolloutStable when
the session mode is force_canary or force_stable (not just when
canary_fraction > 0), ensuring prepared statement memo caching works
correctly with the per-session overrides.

Epic: none
Release note (sql change): The canary_stats_mode session variable values
are renamed from "off"/"on" to "force_stable"/"force_canary". These modes
now work independently of the sql.stats.canary_fraction cluster setting,
allowing per-session opt-in without cluster-wide enrollment.

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@blathers-crl blathers-crl bot requested a review from a team as a code owner April 13, 2026 22:17
@blathers-crl blathers-crl bot requested review from DrewKimball and removed request for a team April 13, 2026 22:17
@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Apr 13, 2026
@blathers-crl blathers-crl bot requested a review from michae2 April 13, 2026 22:17
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io bot commented Apr 13, 2026

Merging to release-26.2 in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@blathers-crl
Copy link
Copy Markdown
Author

blathers-crl bot commented Apr 13, 2026

Thanks for opening a backport.

Before merging, please confirm that the change does not break backwards compatibility and otherwise complies with the backport policy. Include a brief release justification in the PR description explaining why the backport is appropriate. All backports must be reviewed by the TL for the owning area. While the stricter LTS policy does not yet apply, please exercise judgment and consider gating non-critical changes behind a disabled-by-default feature flag when appropriate.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Apr 13, 2026
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@blathers-crl
Copy link
Copy Markdown
Author

blathers-crl bot commented Apr 13, 2026

Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link)

Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@michae2 made 1 comment.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on DrewKimball).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR decouples per-session canary stats selection (canary_stats_mode) from the cluster-wide sql.stats.canary_fraction gate, so users can force stable/canary behavior in a single session without enabling the experiment cluster-wide.

Changes:

  • Renames canary_stats_mode values from on/off to force_canary/force_stable and updates docs/descriptions accordingly.
  • Moves the canary_fraction == 0 gate into the auto mode path so explicit session overrides work even when the fraction is 0.
  • Updates the PREPARE path to force StatsRolloutStable when non-auto session overrides are used, and adjusts related execbuilder logic tests.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/sql/vars.go Updates session var value validation error strings for renamed modes.
pkg/sql/sessiondatapb/local_only_session_data.proto Updates inline documentation for the renamed session modes.
pkg/sql/sessiondatapb/local_only_session_data.go Renames the CanaryStatsMode enum string forms and parsing.
pkg/sql/session_var_descriptions.go Updates canary_stats_mode description to reflect new semantics and names.
pkg/sql/opt/exec/execbuilder/testdata/canary_stats Updates/extends logic tests to use force modes and validate PREPARE/EXECUTE behavior.
pkg/sql/logictest/testdata/logic_test/show_source Updates expected SHOW ALL description output for canary_stats_mode.
pkg/sql/conn_executor_prepare.go Forces stable rollout during PREPARE when fraction > 0 or when session override is non-auto.
pkg/sql/conn_executor.go Updates canaryRollDice so fraction=0 only affects auto, not forced modes.
docs/generated/sql/session_vars.md Updates generated session variable documentation for renamed modes and semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/sql/conn_executor.go
Comment on lines 161 to 179
// canaryRollDice determines the stats rollout selection for a query.
// The result is one of:
// - StatsRolloutDefault: experiment not active, use default stats.
// - StatsRolloutCanary: experiment active, dice selected canary (newest) stats.
// - StatsRolloutStable: experiment active, dice selected stable (second-newest) stats.
//
// The selection is atomic per query — all tables in the query use the same
// rollout path. Only called for non-internal queries.
func canaryRollDice(
ctx context.Context, evalCtx *eval.Context, rng *rand.Rand,
) eval.StatsRolloutSelection {
if !evalCtx.Settings.Version.IsActive(ctx, clusterversion.V26_2) {
return eval.StatsRolloutDefault
}
threshold := stats.CanaryFraction.Get(&evalCtx.Settings.SV)
if threshold == 0 {
return eval.StatsRolloutDefault
}
switch m := evalCtx.SessionData().CanaryStatsMode; m {
case sessiondatapb.CanaryStatsModeOff:
case sessiondatapb.CanaryStatsModeForceStable:
return eval.StatsRolloutStable
case sessiondatapb.CanaryStatsModeOn:
case sessiondatapb.CanaryStatsModeForceCanary:
return eval.StatsRolloutCanary
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The canaryRollDice header comment and the bullet list below it describe Canary/Stable results as only occurring when the experiment is active, but after this change StatsRolloutStable/Canary can also be returned via force_stable/force_canary even when sql.stats.canary_fraction is 0. Please update the comment to reflect that forced session modes override the cluster setting and can yield non-Default rollouts even when the fraction is zero.

Copilot uses AI. Check for mistakes.
@@ -546,7 +541,7 @@ query cache hit (prepare)

# Stable EXECUTE reuses the prepared memo.
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment “Stable EXECUTE reuses the prepared memo” is misleading in this test: PREPARE ran while canary_stats_mode was still auto, so switching to force_stable only here causes the first EXECUTEs to rebuild due to rollout-mode mismatch (as shown by the expected trace output). Consider setting canary_stats_mode=force_stable before PREPARE, or updating the comment to explicitly call out the initial rebuild behavior after the mode switch.

Suggested change
# Stable EXECUTE reuses the prepared memo.
# PREPARE ran before switching from auto to force_stable, so the first
# stable EXECUTEs rebuild due to rollout-mode mismatch before later
# EXECUTEs can reuse the stable prepared memo.

Copilot uses AI. Check for mistakes.
Comment on lines +472 to +474
case "FORCE_STABLE":
return CanaryStatsModeForceStable, true
case "FORCE_CANARY":
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CanaryStatsModeFromString no longer recognizes the previous session variable values "on"/"off". This is a user-visible breaking change for existing scripts that used SET canary_stats_mode = on/off. Consider continuing to accept "ON" and "OFF" as deprecated aliases for FORCE_CANARY/FORCE_STABLE (while still printing the new names) to preserve compatibility across upgrades/backports.

Suggested change
case "FORCE_STABLE":
return CanaryStatsModeForceStable, true
case "FORCE_CANARY":
case "FORCE_STABLE", "OFF":
return CanaryStatsModeForceStable, true
case "FORCE_CANARY", "ON":

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport Label PR's that are backports to older release branches blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants