feat(bigquery,cubestore): Parquet pre-aggregation export with WIF-native URLs by KrishnaRMaddikara · Pull Request #10499 · cube-js/cube

KrishnaRMaddikara · 2026-03-16T04:17:28Z

Problem

Three related issues affecting BigQuery pre-aggregation pipelines on GKE:

1. CSV.gz export is slow and expensive
BigQuery driver exports pre-aggregation data as CSV.gz. For large tables
this produces gigabytes of intermediate files in GCS. Parquet is 3-5x
smaller and is already the native CubeStore internal chunk format.

2. getSignedUrl() is broken on GKE Workload Identity (WIF)
getSignedUrl() requires service account key bytes to cryptographically
sign URLs. WIF tokens (short-lived OAuth2 tokens from the GKE metadata
server) cannot sign URLs. The pre-aggregation pipeline silently fails:
BigQuery exports successfully but CubeStore receives 403 on every URL.
Affects all GKE users with Workload Identity configured.

3. CubeStore cannot import Parquet (issue #3051)
CubeStore's CREATE TABLE ... WITH (input_format) only accepted csv
and csv_no_header. CubeStore already uses Parquet/Arrow internally for
.chunk.parquet files but the external import path lacked Parquet support.

Solution

packages/cubejs-bigquery-driver/src/BigQueryDriver.ts

Export format: CSV + gzip: true → PARQUET
URL generation: getSignedUrl() → gs://bucket/object (plain IAM-authenticated URI)
Return key: csvFile → parquetFile

packages/cubejs-cubestore-driver/src/CubeStoreDriver.ts

Add importParquetFile() method
Add parquetFile branch in uploadTableWithIndexes()
Sends: CREATE TABLE t (...) WITH (input_format = 'parquet') LOCATION 'gs://...'

rust/cubestore/cubestore/src/metastore/mod.rs

Add ImportFormat::Parquet variant to enum

rust/cubestore/cubestore/src/sql/mod.rs

Parse input_format = 'parquet' in WITH clause

rust/cubestore/cubestore/src/import/mod.rs

Dispatch ImportFormat::Parquet to do_import_parquet()
Add do_import_parquet() using DataFusion ParquetRecordBatchReaderBuilder
(already compiled into CubeStore — no new dependencies)
Add arrow_array_value_to_string() helper for Arrow → TableValue conversion
Fix resolve_location() to download gs:// URLs via GCS JSON API with WIF token
Fix estimate_location_row_count() to skip fs::metadata() for remote URLs

Backward Compatibility

Postgres, Snowflake, Redshift pre-aggregations: completely unaffected
Existing BigQuery deployments using SA key files: continue to work
Works standalone or combined with the GCS WIF fix (PR feat(cubestore): GCS Workload Identity Federation (WIF/ADC) support #10498)

Testing

Tested on GKE with Workload Identity Federation:

BigQuery exports pre-aggregation as Parquet to GCS bucket
CubeStore receives gs:// URI, downloads via GCS API using WIF token
CREATE TABLE ... WITH (input_format = 'parquet') imports successfully
Pre-aggregation queries resolve at sub-millisecond latency

Closes #3051
Closes #9837

…ive URLs Problem 1 — CSV.gz export is slow and expensive: BigQuery driver exports pre-aggregation data as CSV.gz. For large tables this means gigabytes of intermediate files. Parquet is 3-5x smaller and is the native CubeStore internal format. Problem 2 — getSignedUrl() requires SA key bytes (broken on GKE WIF): getSignedUrl() requires service account key bytes to sign URLs. WIF tokens from the metadata server cannot sign URLs. Pre-agg pipeline silently fails: BQ exports fine, CubeStore gets 403. Problem 3 — CubeStore cannot import Parquet (issue cube-js#3051): CubeStore only accepted CSV in its external import path. CubeStore already uses parquet/arrow internally for .chunk.parquet but CREATE TABLE ... WITH (input_format) lacked Parquet support. Fix: packages/cubejs-bigquery-driver/src/BigQueryDriver.ts: - Export format: CSV.gz -> PARQUET - URL generation: getSignedUrl() -> gs://bucket/object (IAM-authenticated) - Return key: csvFile -> parquetFile packages/cubejs-cubestore-driver/src/CubeStoreDriver.ts: - Add importParquetFile() method - Add parquetFile branch in uploadTableWithIndexes() - Sends: CREATE TABLE t (...) WITH (input_format = 'parquet') LOCATION 'gs://...' rust/cubestore/cubestore/src/metastore/mod.rs: - Add ImportFormat::Parquet variant to enum rust/cubestore/cubestore/src/sql/mod.rs: - Parse input_format = 'parquet' in WITH clause rust/cubestore/cubestore/src/import/mod.rs: - Dispatch ImportFormat::Parquet to do_import_parquet() - Add do_import_parquet() using DataFusion ParquetRecordBatchReaderBuilder - Add arrow_array_value_to_string() helper for Arrow to TableValue conversion - Fix resolve_location() to handle gs:// URLs via GCS API with WIF token - Fix estimate_location_row_count() to skip fs::metadata() for remote URLs Works with Workload Identity when combined with the GCS WIF fix in gcs.rs. Backward compatible: Postgres/Snowflake/Redshift pre-aggs unaffected. Closes cube-js#3051 Closes cube-js#9837

…uet path - Make csvFile optional in TableCSVData — BigQuery now returns parquetFile only - Add parquetFile?: string[] to TableCSVData interface - Update isDownloadTableCSVData() to recognise parquetFile - Delete stale export files before BQ extract to prevent prefix collision - Remove exportBucketCsvEscapeSymbol from Parquet return (CSV-specific field)

KrishnaRMaddikara · 2026-03-16T06:00:29Z

Addressed few other fixes 2fe8fa6:

✅ Type safety: added parquetFile?: string[] to TableCSVData in
driver.interface.ts, made csvFile optional, updated
isDownloadTableCSVData() to recognise both csvFile and parquetFile
✅ Stale file cleanup: delete matching ${table}- prefix files
before each BQ extract to prevent stale/concurrent file collision
✅ Removed exportBucketCsvEscapeSymbol from Parquet return path
(CSV-specific field, not relevant for Parquet)
✅ Issue closure softened: closes Cubestore with external bucket support parquet file format: support more ingestion formats other than CSV #3051 here,
Support Workload Identity Federation for GCS without requiring explicit credentials in cube/cubestore Docker image #9837 addressed in combination with PR feat(cubestore): GCS Workload Identity Federation (WIF/ADC) support #10498

Ready for review.

KrishnaRMaddikara requested review from a team as code owners March 16, 2026 04:17

vercel bot deployed to Preview March 16, 2026 04:20 View deployment

vercel bot deployed to Preview March 16, 2026 06:00 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bigquery,cubestore): Parquet pre-aggregation export with WIF-native URLs#10499

feat(bigquery,cubestore): Parquet pre-aggregation export with WIF-native URLs#10499
KrishnaRMaddikara wants to merge 2 commits intocube-js:masterfrom
KrishnaRMaddikara:feat/bigquery-parquet-wif-preagg

KrishnaRMaddikara commented Mar 16, 2026

Uh oh!

KrishnaRMaddikara commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KrishnaRMaddikara commented Mar 16, 2026

Problem

Solution

Backward Compatibility

Testing

Uh oh!

KrishnaRMaddikara commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant