Add BigQuery support#458
Open
gajop wants to merge 2 commits into
Open
Conversation
Add a native BigQuery reader behind a new off-by-default 'bigquery' feature flag. BigQueryReader authenticates via Application Default Credentials and accepts bigquery://[PROJECT[/DATASET]][?location=REGION] connection strings; the project resolves from ADC / GOOGLE_CLOUD_PROJECT when omitted. VS Code / Positron and the Jupyter kernel recognise the same scheme. Two SQL-generation fixes were needed for BigQuery's strict semantics; both are output-identical on DuckDB/SQLite: - Histogram: bin_end and density are now derived in the outer SELECT from the already-grouped bin/count columns, instead of inside the GROUP BY query. BigQuery rejects a GROUP BY query whose SELECT list references an input column outside the grouping key. - BigQuery percentile: sql_percentile / sql_quantile_inline now emit the APPROX_QUANTILES aggregate instead of a correlated scalar subquery, which BigQuery rejects for grouped boxplot / density. Integration tests (bq_integration_*) create a uuid-named dataset, run the public API against it, and drop it on completion; they are #[ignore] by default and opt in via GGSQL_BIGQUERY_TEST_URI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The fixture now also creates a `types` table (DATE/TIMESTAMP/DATETIME/ TIME/NUMERIC, plus an all-NULL row) and a 25_000-row `wide` table. New #[ignore] integration tests: - type conversion — asserts each BigQuery type maps to the expected Arrow dtype. - pagination — a 25_000-row scan must stitch three result pages (PAGE_SIZE is 10_000). - query error — a failing query surfaces as Err, not a panic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hello!
Since this is my PR here, I wanted to give a short introduction.
My name is Gajo, and at my work we use BigQuery a lot. This looks like an interesting project, and I would like to give it a go, so I decided to try adding BigQuery support.
All the code in this PR was written by Claude but I did my review and got it to a reasonable state.
I purposefully didn't add it to the default feature set, as I'm not sure which direction you would like to take, but I think it would be nice if we could add it later. Maybe even in this PR?
As I was writing this, I hit two problems with the current implementation of Histogram & Percentile, and tbh I'm not familiar with the library enough to tell if this is the right way of handling it. Learning about the finer details felt a bit daunting/somewhat out of scope for what I wanted to do in this PR. I did manage to satisfy the tests at least... but anyway, this is something that I'd take a deeper look at.
On testing, I added a couple of your usual unit tests, but I also added some integration tests, that create real tables and demonstrate that this works with real BigQuery, but since this requires a real infrastructure I've made them disabled by default. I did manage to confirm that they pass in one of my projects.
As a future PR, I would like to add some examples for DuckDB and BigQuery, something users can easily run. I would also like to dig a bit deeper in the security design of this, in case ggsql is used behind a library where user input might not be trusted (I'm not yet sure how you would support parametrized queries here). Also personally not a fan of large files, and if it's OK with you, I'd at least split the bigquery reader into a few files.. but leaving this up to you.
Regarding this PR, if you would like to first discuss the approach or maybe start with something smaller, please let me know. I can understand the potential maintenance burden of adding new features from new contributors - the drive-by PR is certainly a thing these days.
Below is AI generated PR message
feature flag (enable with
--features bigquery, orall-readers).BigQueryReader(src/reader/bigquery.rs) — authenticates via ApplicationDefault Credentials using the
gcloud-bigquerycrate; paginates results andconverts them to Arrow.
bigquery://[PROJECT[/DATASET]][?location=REGION]— theproject is optional and resolves from ADC /
GOOGLE_CLOUD_PROJECTwhenomitted; an optional dataset sets the default dataset;
locationdefaults toUS.BigQueryDialect— backtick quoting, BigQuery type names(
INT64/FLOAT64/STRING/DATETIME),GREATEST/LEAST,GENERATE_ARRAYseries,
APPROX_QUANTILESquantiles, andCREATE OR REPLACE TEMP TABLE.bigquery://;cli.qmddocuments the scheme and ADC auth.SQL portability fixes
Two SQL-generation issues surfaced under BigQuery's stricter semantics. Both
fixes are output-identical on DuckDB/SQLite:
bin_endanddensityare now derived in the outer SELECTfrom the already-grouped
bin/countcolumns, instead of inside the GROUP BYquery. BigQuery rejects a GROUP BY query whose SELECT list references an input
column outside the grouping key.
sql_percentile/sql_quantile_inlineemit theAPPROX_QUANTILESaggregate instead of a correlated scalar subquery, whichBigQuery rejects for grouped boxplot / density.
Testing
bq_integration_*) —#[ignore]by default, opt invia
GGSQL_BIGQUERY_TEST_URI. They create a uuid-named dataset (auto-dropped)and cover catalog/schema/table/column introspection,
execute_sql,point/boxplot/grouped-boxplot/histogram rendering, type conversion, result
pagination, and query-error propagation.
clippyclean with--features bigquery.Follow-ups (not in this PR)
bigqueryfeature; it should gain stepsmirroring the existing ADBC ones, or
bigquery.rswill bitrot.ggsql,ggsql-jupyter) build with default features and sodo not include
bigquery— needs a decision on whether to ship it.src/CLAUDE.md/ggsql-jupyter/CLAUDE.mdreader and feature-flag tables needupdating to list
bigquery(andadbc).