[SPARK-55978][SQL] Add TABLESAMPLE SYSTEM block sampling with DSv2 pushdown by stanyao · Pull Request #54972 · apache/spark

stanyao · 2026-03-24T00:29:16Z

What changes were proposed in this pull request?

This PR adds support for ANSI SQL TABLESAMPLE SYSTEM (block-level sampling) alongside the existing TABLESAMPLE BERNOULLI (row-level sampling). Key changes:

SQL grammar: Extended TABLESAMPLE to accept an optional SYSTEM or BERNOULLI qualifier before the sample method. Added both as non-reserved keywords.
Logical plan: Introduced SampleMethod sealed trait (Bernoulli/System) and added it to the Sample node. Default is Bernoulli for backward compatibility.
Parser: TABLESAMPLE SYSTEM only supports PERCENT sampling and does not support REPEATABLE. Other methods (ROWS, BUCKET, BYTES) are rejected with clear error messages.
DSv2 pushdown: TABLESAMPLE SYSTEM is pushed down to data sources via an extended SupportsPushDownTableSample.pushTableSample() overload with isSystemSampling flag. Sources that don't override the new method reject SYSTEM sampling by default.
Physical planning: SYSTEM samples that aren't pushed down to a DSv2 source raise an AnalysisException — there is no row-level fallback since block sampling is data-source dependent.

Why are the changes needed?

ANSI SQL defines two sampling methods: BERNOULLI (row-level) and SYSTEM (implementation-dependent, typically block/split-level). Block sampling is significantly faster for large tables since it avoids per-row evaluation, making it useful for approximate queries and data exploration. Many databases (PostgreSQL, Hive, Trino) support this distinction.

Does this PR introduce any user-facing change?

Yes. New SQL syntax TABLESAMPLE SYSTEM (x PERCENT) and TABLESAMPLE BERNOULLI (x PERCENT). BERNOULLI and SYSTEM are added as non-reserved keywords. Existing queries without these keywords behave identically to before.

How was this patch tested?

9 new test cases in PlanParserSuite covering: basic parsing, case insensitivity, boundary fractions, unsupported methods (ROWS/BUCKET with SYSTEM), REPEATABLE rejection, fraction validation, identifier preservation, and subquery contexts.
Existing SQLQuerySuite tests pass.
Scalastyle passes with 0 errors/warnings.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (thoroughly refined, reviewed, and tested by human)

…shdown ### What changes were proposed in this pull request? This PR adds support for ANSI SQL `TABLESAMPLE SYSTEM` (block-level sampling) alongside the existing `TABLESAMPLE BERNOULLI` (row-level sampling). Key changes: - **SQL grammar**: Extended `TABLESAMPLE` to accept an optional `SYSTEM` or `BERNOULLI` qualifier before the sample method. Added both as non-reserved keywords. - **Logical plan**: Introduced `SampleMethod` sealed trait (`Bernoulli`/`System`) and added it to the `Sample` node. Default is `Bernoulli` for backward compatibility. - **Parser**: `TABLESAMPLE SYSTEM` only supports `PERCENT` sampling and does not support `REPEATABLE`. Other methods (ROWS, BUCKET, BYTES) are rejected with clear error messages. - **DSv2 pushdown**: `TABLESAMPLE SYSTEM` is pushed down to data sources via an extended `SupportsPushDownTableSample.pushTableSample()` overload with `isSystemSampling` flag. Sources that don't override the new method reject SYSTEM sampling by default. - **Physical planning**: SYSTEM samples that aren't pushed down to a DSv2 source raise an `AnalysisException` — there is no row-level fallback since block sampling is data-source dependent. ### Why are the changes needed? ANSI SQL defines two sampling methods: `BERNOULLI` (row-level) and `SYSTEM` (implementation-dependent, typically block/split-level). Block sampling is significantly faster for large tables since it avoids per-row evaluation, making it useful for approximate queries and data exploration. Many databases (PostgreSQL, Hive, Trino) support this distinction. ### Does this PR introduce _any_ user-facing change? Yes. New SQL syntax `TABLESAMPLE SYSTEM (x PERCENT)` and `TABLESAMPLE BERNOULLI (x PERCENT)`. `BERNOULLI` and `SYSTEM` are added as non-reserved keywords. Existing queries without these keywords behave identically to before. ### How was this patch tested? - 9 new test cases in `PlanParserSuite` covering: basic parsing, case insensitivity, boundary fractions, unsupported methods (ROWS/BUCKET with SYSTEM), REPEATABLE rejection, fraction validation, identifier preservation, and subquery contexts. - Existing `SQLQuerySuite` tests pass. - Scalastyle passes with 0 errors/warnings.

stanyao changed the title ~~[SPARK-55978][SQL] Add TABLESAMPLE SYSTEM block sampling with DSv2 pu…~~ [SPARK-55978][SQL] Add TABLESAMPLE SYSTEM block sampling with DSv2 pushdown Mar 24, 2026

stanyao force-pushed the spark-55978-tablesample-system branch from 34548e4 to 9aa99d8 Compare March 24, 2026 07:33

stanyao force-pushed the spark-55978-tablesample-system branch from 9aa99d8 to bd6f9cf Compare March 24, 2026 14:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55978][SQL] Add TABLESAMPLE SYSTEM block sampling with DSv2 pushdown#54972

[SPARK-55978][SQL] Add TABLESAMPLE SYSTEM block sampling with DSv2 pushdown#54972
stanyao wants to merge 1 commit intoapache:masterfrom
stanyao:spark-55978-tablesample-system

stanyao commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stanyao commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stanyao commented Mar 24, 2026 •

edited

Loading