feat(bigquery,cubestore): Parquet pre-aggregation export with WIF-native URLs#10499
feat(bigquery,cubestore): Parquet pre-aggregation export with WIF-native URLs#10499KrishnaRMaddikara wants to merge 2 commits intocube-js:masterfrom
Conversation
…ive URLs Problem 1 — CSV.gz export is slow and expensive: BigQuery driver exports pre-aggregation data as CSV.gz. For large tables this means gigabytes of intermediate files. Parquet is 3-5x smaller and is the native CubeStore internal format. Problem 2 — getSignedUrl() requires SA key bytes (broken on GKE WIF): getSignedUrl() requires service account key bytes to sign URLs. WIF tokens from the metadata server cannot sign URLs. Pre-agg pipeline silently fails: BQ exports fine, CubeStore gets 403. Problem 3 — CubeStore cannot import Parquet (issue cube-js#3051): CubeStore only accepted CSV in its external import path. CubeStore already uses parquet/arrow internally for .chunk.parquet but CREATE TABLE ... WITH (input_format) lacked Parquet support. Fix: packages/cubejs-bigquery-driver/src/BigQueryDriver.ts: - Export format: CSV.gz -> PARQUET - URL generation: getSignedUrl() -> gs://bucket/object (IAM-authenticated) - Return key: csvFile -> parquetFile packages/cubejs-cubestore-driver/src/CubeStoreDriver.ts: - Add importParquetFile() method - Add parquetFile branch in uploadTableWithIndexes() - Sends: CREATE TABLE t (...) WITH (input_format = 'parquet') LOCATION 'gs://...' rust/cubestore/cubestore/src/metastore/mod.rs: - Add ImportFormat::Parquet variant to enum rust/cubestore/cubestore/src/sql/mod.rs: - Parse input_format = 'parquet' in WITH clause rust/cubestore/cubestore/src/import/mod.rs: - Dispatch ImportFormat::Parquet to do_import_parquet() - Add do_import_parquet() using DataFusion ParquetRecordBatchReaderBuilder - Add arrow_array_value_to_string() helper for Arrow to TableValue conversion - Fix resolve_location() to handle gs:// URLs via GCS API with WIF token - Fix estimate_location_row_count() to skip fs::metadata() for remote URLs Works with Workload Identity when combined with the GCS WIF fix in gcs.rs. Backward compatible: Postgres/Snowflake/Redshift pre-aggs unaffected. Closes cube-js#3051 Closes cube-js#9837
…uet path - Make csvFile optional in TableCSVData — BigQuery now returns parquetFile only - Add parquetFile?: string[] to TableCSVData interface - Update isDownloadTableCSVData() to recognise parquetFile - Delete stale export files before BQ extract to prevent prefix collision - Remove exportBucketCsvEscapeSymbol from Parquet return (CSV-specific field)
|
Addressed few other fixes 2fe8fa6:
Ready for review. |
Problem
Three related issues affecting BigQuery pre-aggregation pipelines on GKE:
1. CSV.gz export is slow and expensive
BigQuery driver exports pre-aggregation data as CSV.gz. For large tables
this produces gigabytes of intermediate files in GCS. Parquet is 3-5x
smaller and is already the native CubeStore internal chunk format.
2.
getSignedUrl()is broken on GKE Workload Identity (WIF)getSignedUrl()requires service account key bytes to cryptographicallysign URLs. WIF tokens (short-lived OAuth2 tokens from the GKE metadata
server) cannot sign URLs. The pre-aggregation pipeline silently fails:
BigQuery exports successfully but CubeStore receives 403 on every URL.
Affects all GKE users with Workload Identity configured.
3. CubeStore cannot import Parquet (issue #3051)
CubeStore's
CREATE TABLE ... WITH (input_format)only acceptedcsvand
csv_no_header. CubeStore already uses Parquet/Arrow internally for.chunk.parquetfiles but the external import path lacked Parquet support.Solution
packages/cubejs-bigquery-driver/src/BigQueryDriver.tsCSV+gzip: true→PARQUETgetSignedUrl()→gs://bucket/object(plain IAM-authenticated URI)csvFile→parquetFilepackages/cubejs-cubestore-driver/src/CubeStoreDriver.tsimportParquetFile()methodparquetFilebranch inuploadTableWithIndexes()CREATE TABLE t (...) WITH (input_format = 'parquet') LOCATION 'gs://...'rust/cubestore/cubestore/src/metastore/mod.rsImportFormat::Parquetvariant to enumrust/cubestore/cubestore/src/sql/mod.rsinput_format = 'parquet'inWITHclauserust/cubestore/cubestore/src/import/mod.rsImportFormat::Parquettodo_import_parquet()do_import_parquet()using DataFusionParquetRecordBatchReaderBuilder(already compiled into CubeStore — no new dependencies)
arrow_array_value_to_string()helper for Arrow →TableValueconversionresolve_location()to downloadgs://URLs via GCS JSON API with WIF tokenestimate_location_row_count()to skipfs::metadata()for remote URLsBackward Compatibility
Testing
Tested on GKE with Workload Identity Federation:
gs://URI, downloads via GCS API using WIF tokenCREATE TABLE ... WITH (input_format = 'parquet')imports successfullyCloses #3051
Closes #9837