Skip to content

[SPARK-55978][SQL] Add TABLESAMPLE SYSTEM block sampling with DSv2 pushdown#54972

Open
stanyao wants to merge 1 commit intoapache:masterfrom
stanyao:spark-55978-tablesample-system
Open

[SPARK-55978][SQL] Add TABLESAMPLE SYSTEM block sampling with DSv2 pushdown#54972
stanyao wants to merge 1 commit intoapache:masterfrom
stanyao:spark-55978-tablesample-system

Conversation

@stanyao
Copy link

@stanyao stanyao commented Mar 24, 2026

What changes were proposed in this pull request?

This PR adds support for ANSI SQL TABLESAMPLE SYSTEM (block-level sampling) alongside the existing TABLESAMPLE BERNOULLI (row-level sampling). Key changes:

  • SQL grammar: Extended TABLESAMPLE to accept an optional SYSTEM or BERNOULLI qualifier before the sample method. Added both as non-reserved keywords.
  • Logical plan: Introduced SampleMethod sealed trait (Bernoulli/System) and added it to the Sample node. Default is Bernoulli for backward compatibility.
  • Parser: TABLESAMPLE SYSTEM only supports PERCENT sampling and does not support REPEATABLE. Other methods (ROWS, BUCKET, BYTES) are rejected with clear error messages.
  • DSv2 pushdown: TABLESAMPLE SYSTEM is pushed down to data sources via an extended SupportsPushDownTableSample.pushTableSample() overload with isSystemSampling flag. Sources that don't override the new method reject SYSTEM sampling by default.
  • Physical planning: SYSTEM samples that aren't pushed down to a DSv2 source raise an AnalysisException — there is no row-level fallback since block sampling is data-source dependent.

Why are the changes needed?

ANSI SQL defines two sampling methods: BERNOULLI (row-level) and SYSTEM (implementation-dependent, typically block/split-level). Block sampling is significantly faster for large tables since it avoids per-row evaluation, making it useful for approximate queries and data exploration. Many databases (PostgreSQL, Hive, Trino) support this distinction.

Does this PR introduce any user-facing change?

Yes. New SQL syntax TABLESAMPLE SYSTEM (x PERCENT) and TABLESAMPLE BERNOULLI (x PERCENT). BERNOULLI and SYSTEM are added as non-reserved keywords. Existing queries without these keywords behave identically to before.

How was this patch tested?

  • 9 new test cases in PlanParserSuite covering: basic parsing, case insensitivity, boundary fractions, unsupported methods (ROWS/BUCKET with SYSTEM), REPEATABLE rejection, fraction validation, identifier preservation, and subquery contexts.
  • Existing SQLQuerySuite tests pass.
  • Scalastyle passes with 0 errors/warnings.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (thoroughly refined, reviewed, and tested by human)

@stanyao stanyao changed the title [SPARK-55978][SQL] Add TABLESAMPLE SYSTEM block sampling with DSv2 pu… [SPARK-55978][SQL] Add TABLESAMPLE SYSTEM block sampling with DSv2 pushdown Mar 24, 2026
@stanyao stanyao force-pushed the spark-55978-tablesample-system branch from 34548e4 to 9aa99d8 Compare March 24, 2026 07:33
…shdown

### What changes were proposed in this pull request?

This PR adds support for ANSI SQL `TABLESAMPLE SYSTEM` (block-level sampling) alongside the existing `TABLESAMPLE BERNOULLI` (row-level sampling). Key changes:

- **SQL grammar**: Extended `TABLESAMPLE` to accept an optional `SYSTEM` or `BERNOULLI` qualifier before the sample method. Added both as non-reserved keywords.
- **Logical plan**: Introduced `SampleMethod` sealed trait (`Bernoulli`/`System`) and added it to the `Sample` node. Default is `Bernoulli` for backward compatibility.
- **Parser**: `TABLESAMPLE SYSTEM` only supports `PERCENT` sampling and does not support `REPEATABLE`. Other methods (ROWS, BUCKET, BYTES) are rejected with clear error messages.
- **DSv2 pushdown**: `TABLESAMPLE SYSTEM` is pushed down to data sources via an extended `SupportsPushDownTableSample.pushTableSample()` overload with `isSystemSampling` flag. Sources that don't override the new method reject SYSTEM sampling by default.
- **Physical planning**: SYSTEM samples that aren't pushed down to a DSv2 source raise an `AnalysisException` — there is no row-level fallback since block sampling is data-source dependent.

### Why are the changes needed?

ANSI SQL defines two sampling methods: `BERNOULLI` (row-level) and `SYSTEM` (implementation-dependent, typically block/split-level). Block sampling is significantly faster for large tables since it avoids per-row evaluation, making it useful for approximate queries and data exploration. Many databases (PostgreSQL, Hive, Trino) support this distinction.

### Does this PR introduce _any_ user-facing change?

Yes. New SQL syntax `TABLESAMPLE SYSTEM (x PERCENT)` and `TABLESAMPLE BERNOULLI (x PERCENT)`. `BERNOULLI` and `SYSTEM` are added as non-reserved keywords. Existing queries without these keywords behave identically to before.

### How was this patch tested?

- 9 new test cases in `PlanParserSuite` covering: basic parsing, case insensitivity, boundary fractions, unsupported methods (ROWS/BUCKET with SYSTEM), REPEATABLE rejection, fraction validation, identifier preservation, and subquery contexts.
- Existing `SQLQuerySuite` tests pass.
- Scalastyle passes with 0 errors/warnings.
@stanyao stanyao force-pushed the spark-55978-tablesample-system branch from 9aa99d8 to bd6f9cf Compare March 24, 2026 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant