Skip to content

feat(bigquery,cubestore): Parquet pre-aggregation export with WIF-native URLs#10499

Open
KrishnaRMaddikara wants to merge 2 commits intocube-js:masterfrom
KrishnaRMaddikara:feat/bigquery-parquet-wif-preagg
Open

feat(bigquery,cubestore): Parquet pre-aggregation export with WIF-native URLs#10499
KrishnaRMaddikara wants to merge 2 commits intocube-js:masterfrom
KrishnaRMaddikara:feat/bigquery-parquet-wif-preagg

Conversation

@KrishnaRMaddikara
Copy link

Problem

Three related issues affecting BigQuery pre-aggregation pipelines on GKE:

1. CSV.gz export is slow and expensive
BigQuery driver exports pre-aggregation data as CSV.gz. For large tables
this produces gigabytes of intermediate files in GCS. Parquet is 3-5x
smaller and is already the native CubeStore internal chunk format.

2. getSignedUrl() is broken on GKE Workload Identity (WIF)
getSignedUrl() requires service account key bytes to cryptographically
sign URLs. WIF tokens (short-lived OAuth2 tokens from the GKE metadata
server) cannot sign URLs. The pre-aggregation pipeline silently fails:
BigQuery exports successfully but CubeStore receives 403 on every URL.
Affects all GKE users with Workload Identity configured.

3. CubeStore cannot import Parquet (issue #3051)
CubeStore's CREATE TABLE ... WITH (input_format) only accepted csv
and csv_no_header. CubeStore already uses Parquet/Arrow internally for
.chunk.parquet files but the external import path lacked Parquet support.

Solution

packages/cubejs-bigquery-driver/src/BigQueryDriver.ts

  • Export format: CSV + gzip: truePARQUET
  • URL generation: getSignedUrl()gs://bucket/object (plain IAM-authenticated URI)
  • Return key: csvFileparquetFile

packages/cubejs-cubestore-driver/src/CubeStoreDriver.ts

  • Add importParquetFile() method
  • Add parquetFile branch in uploadTableWithIndexes()
  • Sends: CREATE TABLE t (...) WITH (input_format = 'parquet') LOCATION 'gs://...'

rust/cubestore/cubestore/src/metastore/mod.rs

  • Add ImportFormat::Parquet variant to enum

rust/cubestore/cubestore/src/sql/mod.rs

  • Parse input_format = 'parquet' in WITH clause

rust/cubestore/cubestore/src/import/mod.rs

  • Dispatch ImportFormat::Parquet to do_import_parquet()
  • Add do_import_parquet() using DataFusion ParquetRecordBatchReaderBuilder
    (already compiled into CubeStore — no new dependencies)
  • Add arrow_array_value_to_string() helper for Arrow → TableValue conversion
  • Fix resolve_location() to download gs:// URLs via GCS JSON API with WIF token
  • Fix estimate_location_row_count() to skip fs::metadata() for remote URLs

Backward Compatibility

Testing

Tested on GKE with Workload Identity Federation:

  • BigQuery exports pre-aggregation as Parquet to GCS bucket
  • CubeStore receives gs:// URI, downloads via GCS API using WIF token
  • CREATE TABLE ... WITH (input_format = 'parquet') imports successfully
  • Pre-aggregation queries resolve at sub-millisecond latency

Closes #3051
Closes #9837

…ive URLs

Problem 1 — CSV.gz export is slow and expensive:
  BigQuery driver exports pre-aggregation data as CSV.gz.
  For large tables this means gigabytes of intermediate files.
  Parquet is 3-5x smaller and is the native CubeStore internal format.

Problem 2 — getSignedUrl() requires SA key bytes (broken on GKE WIF):
  getSignedUrl() requires service account key bytes to sign URLs.
  WIF tokens from the metadata server cannot sign URLs.
  Pre-agg pipeline silently fails: BQ exports fine, CubeStore gets 403.

Problem 3 — CubeStore cannot import Parquet (issue cube-js#3051):
  CubeStore only accepted CSV in its external import path.
  CubeStore already uses parquet/arrow internally for .chunk.parquet
  but CREATE TABLE ... WITH (input_format) lacked Parquet support.

Fix:
  packages/cubejs-bigquery-driver/src/BigQueryDriver.ts:
    - Export format: CSV.gz -> PARQUET
    - URL generation: getSignedUrl() -> gs://bucket/object (IAM-authenticated)
    - Return key: csvFile -> parquetFile

  packages/cubejs-cubestore-driver/src/CubeStoreDriver.ts:
    - Add importParquetFile() method
    - Add parquetFile branch in uploadTableWithIndexes()
    - Sends: CREATE TABLE t (...) WITH (input_format = 'parquet') LOCATION 'gs://...'

  rust/cubestore/cubestore/src/metastore/mod.rs:
    - Add ImportFormat::Parquet variant to enum

  rust/cubestore/cubestore/src/sql/mod.rs:
    - Parse input_format = 'parquet' in WITH clause

  rust/cubestore/cubestore/src/import/mod.rs:
    - Dispatch ImportFormat::Parquet to do_import_parquet()
    - Add do_import_parquet() using DataFusion ParquetRecordBatchReaderBuilder
    - Add arrow_array_value_to_string() helper for Arrow to TableValue conversion
    - Fix resolve_location() to handle gs:// URLs via GCS API with WIF token
    - Fix estimate_location_row_count() to skip fs::metadata() for remote URLs

Works with Workload Identity when combined with the GCS WIF fix in gcs.rs.
Backward compatible: Postgres/Snowflake/Redshift pre-aggs unaffected.

Closes cube-js#3051
Closes cube-js#9837
@KrishnaRMaddikara KrishnaRMaddikara requested review from a team as code owners March 16, 2026 04:17
@github-actions github-actions bot added driver:bigquery Issues related to the BigQuery driver cube store Issues relating to Cube Store rust Pull requests that update Rust code javascript Pull requests that update Javascript code data source driver pr:community Contribution from Cube.js community members. labels Mar 16, 2026
…uet path

- Make csvFile optional in TableCSVData — BigQuery now returns parquetFile only
- Add parquetFile?: string[] to TableCSVData interface
- Update isDownloadTableCSVData() to recognise parquetFile
- Delete stale export files before BQ extract to prevent prefix collision
- Remove exportBucketCsvEscapeSymbol from Parquet return (CSV-specific field)
@KrishnaRMaddikara
Copy link
Author

Addressed few other fixes 2fe8fa6:

  1. ✅ Type safety: added parquetFile?: string[] to TableCSVData in
    driver.interface.ts, made csvFile optional, updated
    isDownloadTableCSVData() to recognise both csvFile and parquetFile

  2. ✅ Stale file cleanup: delete matching ${table}- prefix files
    before each BQ extract to prevent stale/concurrent file collision

  3. ✅ Removed exportBucketCsvEscapeSymbol from Parquet return path
    (CSV-specific field, not relevant for Parquet)

  4. ✅ Issue closure softened: closes Cubestore with external bucket support parquet file format: support more ingestion formats other than CSV #3051 here,
    Support Workload Identity Federation for GCS without requiring explicit credentials in cube/cubestore Docker image #9837 addressed in combination with PR feat(cubestore): GCS Workload Identity Federation (WIF/ADC) support #10498

Ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cube store Issues relating to Cube Store data source driver driver:bigquery Issues related to the BigQuery driver javascript Pull requests that update Javascript code pr:community Contribution from Cube.js community members. rust Pull requests that update Rust code

Projects

None yet

1 participant