Skip to content

Add BigQuery support#458

Open
gajop wants to merge 2 commits into
posit-dev:mainfrom
gajop:feat/add-bigquery
Open

Add BigQuery support#458
gajop wants to merge 2 commits into
posit-dev:mainfrom
gajop:feat/add-bigquery

Conversation

@gajop
Copy link
Copy Markdown

@gajop gajop commented May 21, 2026

Hello!
Since this is my PR here, I wanted to give a short introduction.
My name is Gajo, and at my work we use BigQuery a lot. This looks like an interesting project, and I would like to give it a go, so I decided to try adding BigQuery support.

All the code in this PR was written by Claude but I did my review and got it to a reasonable state.
I purposefully didn't add it to the default feature set, as I'm not sure which direction you would like to take, but I think it would be nice if we could add it later. Maybe even in this PR?

As I was writing this, I hit two problems with the current implementation of Histogram & Percentile, and tbh I'm not familiar with the library enough to tell if this is the right way of handling it. Learning about the finer details felt a bit daunting/somewhat out of scope for what I wanted to do in this PR. I did manage to satisfy the tests at least... but anyway, this is something that I'd take a deeper look at.

On testing, I added a couple of your usual unit tests, but I also added some integration tests, that create real tables and demonstrate that this works with real BigQuery, but since this requires a real infrastructure I've made them disabled by default. I did manage to confirm that they pass in one of my projects.

As a future PR, I would like to add some examples for DuckDB and BigQuery, something users can easily run. I would also like to dig a bit deeper in the security design of this, in case ggsql is used behind a library where user input might not be trusted (I'm not yet sure how you would support parametrized queries here). Also personally not a fan of large files, and if it's OK with you, I'd at least split the bigquery reader into a few files.. but leaving this up to you.

Regarding this PR, if you would like to first discuss the approach or maybe start with something smaller, please let me know. I can understand the potential maintenance burden of adding new features from new contributors - the drive-by PR is certainly a thing these days.

Below is AI generated PR message


feature flag (enable with --features bigquery, or all-readers).

  • BigQueryReader (src/reader/bigquery.rs) — authenticates via Application
    Default Credentials using the gcloud-bigquery crate; paginates results and
    converts them to Arrow.
  • Connection string bigquery://[PROJECT[/DATASET]][?location=REGION] — the
    project is optional and resolves from ADC / GOOGLE_CLOUD_PROJECT when
    omitted; an optional dataset sets the default dataset; location defaults to
    US.
  • BigQueryDialect — backtick quoting, BigQuery type names
    (INT64/FLOAT64/STRING/DATETIME), GREATEST/LEAST, GENERATE_ARRAY
    series, APPROX_QUANTILES quantiles, and CREATE OR REPLACE TEMP TABLE.
  • VS Code / Positron connection picker and the Jupyter kernel both accept
    bigquery://; cli.qmd documents the scheme and ADC auth.

SQL portability fixes

Two SQL-generation issues surfaced under BigQuery's stricter semantics. Both
fixes are output-identical on DuckDB/SQLite:

  • Histogrambin_end and density are now derived in the outer SELECT
    from the already-grouped bin/count columns, instead of inside the GROUP BY
    query. BigQuery rejects a GROUP BY query whose SELECT list references an input
    column outside the grouping key.
  • Percentilesql_percentile / sql_quantile_inline emit the
    APPROX_QUANTILES aggregate instead of a correlated scalar subquery, which
    BigQuery rejects for grouped boxplot / density.

Testing

  • Live integration tests (bq_integration_*) — #[ignore] by default, opt in
    via GGSQL_BIGQUERY_TEST_URI. They create a uuid-named dataset (auto-dropped)
    and cover catalog/schema/table/column introspection, execute_sql,
    point/boxplot/grouped-boxplot/histogram rendering, type conversion, result
    pagination, and query-error propagation.
  • Dialect unit tests for quoting, series generation, and quantile SQL.
  • Full workspace test suite green; clippy clean with --features bigquery.

Follow-ups (not in this PR)

  • CI does not yet compile-check the bigquery feature; it should gain steps
    mirroring the existing ADBC ones, or bigquery.rs will bitrot.
  • Release binaries (ggsql, ggsql-jupyter) build with default features and so
    do not include bigquery — needs a decision on whether to ship it.
  • src/CLAUDE.md / ggsql-jupyter/CLAUDE.md reader and feature-flag tables need
    updating to list bigquery (and adbc).

gajop and others added 2 commits May 21, 2026 07:58
Add a native BigQuery reader behind a new off-by-default 'bigquery'
feature flag. BigQueryReader authenticates via Application Default
Credentials and accepts bigquery://[PROJECT[/DATASET]][?location=REGION]
connection strings; the project resolves from ADC / GOOGLE_CLOUD_PROJECT
when omitted. VS Code / Positron and the Jupyter kernel recognise the
same scheme.

Two SQL-generation fixes were needed for BigQuery's strict semantics;
both are output-identical on DuckDB/SQLite:

- Histogram: bin_end and density are now derived in the outer SELECT
  from the already-grouped bin/count columns, instead of inside the
  GROUP BY query. BigQuery rejects a GROUP BY query whose SELECT list
  references an input column outside the grouping key.

- BigQuery percentile: sql_percentile / sql_quantile_inline now emit
  the APPROX_QUANTILES aggregate instead of a correlated scalar
  subquery, which BigQuery rejects for grouped boxplot / density.

Integration tests (bq_integration_*) create a uuid-named dataset, run
the public API against it, and drop it on completion; they are
#[ignore] by default and opt in via GGSQL_BIGQUERY_TEST_URI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The fixture now also creates a `types` table (DATE/TIMESTAMP/DATETIME/
TIME/NUMERIC, plus an all-NULL row) and a 25_000-row `wide` table.

New #[ignore] integration tests:
- type conversion — asserts each BigQuery type maps to the expected
  Arrow dtype.
- pagination — a 25_000-row scan must stitch three result pages
  (PAGE_SIZE is 10_000).
- query error — a failing query surfaces as Err, not a panic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant