This repository contains a clean, reusable workflow for uploading files from a Google Cloud Storage (GCS) bucket into BigQuery tables.
The main use case is converting fragment-level data files, especially HDF5 (.h5) files, into pandas DataFrames and loading them into BigQuery. The workflow supports both:
- Whole-file uploads for smaller files.
- Chunked/batched uploads for large files.
gcp_bigquery_upload_repo/
├── notebooks/
│ └── upload_files_to_bigquery_template.ipynb
├── src/
│ └── gcp_bigquery_upload/
│ ├── __init__.py
│ └── upload.py
├── scripts/
│ └── upload_gcs_to_bigquery.py
├── examples/
│ └── config.example.yaml
├── requirements.txt
├── pyproject.toml
├── .gitignore
└── README.md
The workflow can:
- List files in a GCS bucket using a prefix and suffix.
- Read
.h5,.parquet,.csv, and.tsvfiles into pandas DataFrames. - Read large HDF5 files in chunks.
- Decode byte-string columns.
- Standardize chromosome formatting.
- Add sample metadata columns based on the source file name.
- Align DataFrames to an existing BigQuery table schema.
- Upload to BigQuery with retry logic.
- Optionally skip files already present in the destination table.
- Optionally deduplicate a BigQuery table after upload.
Clone the repository and install dependencies:
git clone <your-repo-url>
cd gcp_bigquery_upload_repo
python -m pip install -r requirements.txtFor editable development:
python -m pip install -e .Before running uploads, authenticate with Google Cloud:
gcloud auth application-default loginSet your default project if needed:
gcloud config set project YOUR_PROJECT_IDThe service account or user credentials must have permissions for:
- Reading from the source GCS bucket.
- Creating or appending to the destination BigQuery table.
- Reading BigQuery table metadata if schema alignment or skipping existing files is used.
Open:
notebooks/upload_files_to_bigquery_template.ipynb
Edit the configuration cell:
PROJECT_ID = "your-project-id"
BUCKET_NAME = "your-input-bucket"
PREFIX = "path/to/files/"
FILE_SUFFIX = ".h5"
DESTINATION_TABLE = "your-project.your_dataset.your_table"
COLUMNS_TO_KEEP = [
"chrom",
"frag_start",
"frag_end",
"frag_lens",
"upstream_motif",
"downstream_motif",
"variant_sites",
"variant_status",
]
USE_BATCHES = True
CHUNK_SIZE = 1_000_000
WRITE_DISPOSITION = "WRITE_APPEND"Then run the notebook cells.
Example HDF5 upload:
python scripts/upload_gcs_to_bigquery.py \
--project-id your-project-id \
--bucket-name your-input-bucket \
--prefix path/to/files/ \
--file-suffix .h5 \
--destination-table your-project.your_dataset.your_table \
--columns chrom frag_start frag_end frag_lens upstream_motif downstream_motif variant_sites variant_status \
--use-batches \
--chunk-size 1000000 \
--write-disposition WRITE_APPENDExample Parquet upload:
python scripts/upload_gcs_to_bigquery.py \
--project-id your-project-id \
--bucket-name your-input-bucket \
--prefix path/to/parquet/ \
--file-suffix .parquet \
--destination-table your-project.your_dataset.your_table \
--write-disposition WRITE_APPEND| Parameter | Meaning |
|---|---|
PROJECT_ID |
Google Cloud project used by the clients. |
BUCKET_NAME |
Source GCS bucket containing input files. |
PREFIX |
Optional path prefix inside the bucket. |
FILE_SUFFIX |
File type to upload, such as .h5, .parquet, .csv, or .tsv. |
DESTINATION_TABLE |
Fully qualified BigQuery table ID: project.dataset.table. |
COLUMNS_TO_KEEP |
Optional list of columns/datasets to keep. For HDF5 files, these are dataset keys. |
USE_BATCHES |
If True, upload large HDF5 files in chunks. |
CHUNK_SIZE |
Number of rows per chunk for HDF5 batch upload. |
WRITE_DISPOSITION |
BigQuery write mode: WRITE_APPEND, WRITE_TRUNCATE, or WRITE_EMPTY. |
SKIP_EXISTING |
If True, skip files already represented in the destination table. |
SOURCE_COLUMN |
Column used to track uploaded source files. Default: source_file. |
For HDF5 input, each top-level dataset is treated as one DataFrame column.
For example, an HDF5 file with datasets:
chrom
frag_start
frag_end
frag_lens
variant_status
becomes a DataFrame with the same columns.
If COLUMNS_TO_KEEP is provided, only those datasets are read.
If the destination table already exists, the uploader can align the DataFrame to the existing schema:
- Missing table columns are added to the DataFrame as null.
- Extra DataFrame columns are dropped.
- Columns are reordered to match the table schema.
This is useful when uploading multiple files that may not have identical columns.
The upload functions add:
source_file: GCS object path.sample_id: basename of the source file with common suffixes removed.sample_status: optional inferred status, if enabled.sample_class: optional inferred class, if enabled.
You can customize this behavior in add_metadata_columns() in src/gcp_bigquery_upload/upload.py.
The module includes deduplicate_table() to create a cleaned BigQuery table using a ROW_NUMBER() partition strategy.
Example:
from google.cloud import bigquery
from gcp_bigquery_upload.upload import deduplicate_table
client = bigquery.Client(project="your-project-id")
deduplicate_table(
bq_client=client,
input_table="your-project.dataset.raw_table",
output_table="your-project.dataset.clean_table",
partition_columns=[
"chrom",
"frag_start",
"frag_end",
"source_file",
"upstream_motif",
"frag_lens",
"variant_sites",
],
prefer_column="variant_status",
prefer_values=["mut", "non_mut"],
)This repository was cleaned from exploratory Jupyter notebooks. The notebook is now intended to be a runnable template, while the reusable logic lives in src/gcp_bigquery_upload/upload.py.
Recommended workflow:
- Test on 1 small file or use
MAX_FILES = 1. - Preview the DataFrame.
- Confirm destination table and schema.
- Run the full upload.
- Optionally run deduplication or downstream BigQuery SQL cleanup.
The repository also includes a tentative workflow for uploading collected fragmentomics TSV outputs produced by collect_fragmentomics_outputs.R.
Notebook:
notebooks/upload_fragmentomics_tsv_outputs_to_bigquery.ipynb
Reusable functions:
src/gcp_bigquery_upload/fragmentomics.py
The upload helper recognizes these filename suffixes and maps each to an output_type:
| R-script output suffix | output_type |
|---|---|
fragment_scores.tsv |
fragment_scores |
insert_size_summary.tsv |
insert_size_summary |
insert_size_counts.tsv |
insert_size_counts |
alignmentMetrics.tsv |
alignment_metrics |
per5Mb_fragment_ratios.tsv |
per5mb_fragment_ratios |
nucleosome_peak_distances.tsv |
nucleosome_peak_distances |
chromatin_accessibility_distances.tsv |
chromatin_accessibility_distances |
5prime_endmotif_frequencies.tsv |
five_prime_endmotif_frequencies |
3prime_endmotif_frequencies.tsv |
three_prime_endmotif_frequencies |
breakpoint_frequencies.tsv |
breakpoint_frequencies |
dinucleotide_frequencies__len150.tsv |
dinucleotide_frequencies_len150 |
dinucleotide_frequencies__len167.tsv |
dinucleotide_frequencies_len167 |
average_dinucleotide_frequencies.tsv |
average_dinucleotide_frequencies |
Recommended for most analysis workflows because each TSV keeps its native wide/long format.
Example tables:
your-project.your_dataset.fragmentomics_fragment_scores
your-project.your_dataset.fragmentomics_insert_size_summary
your-project.your_dataset.fragmentomics_per5mb_fragment_ratios
Recommended for cataloguing, exploratory browsing, and simple cross-output queries. Heterogeneous TSVs are melted into a shared schema with columns such as:
sample_id
output_type
source_file
metric_name
metric_value_string
metric_value_numeric
Output-specific identifier columns such as bin, seqnames, start, end, motif, context, and Position are retained when present.
from gcp_bigquery_upload.fragmentomics import upload_fragmentomics_outputs
manifest = upload_fragmentomics_outputs(
project_id="your-gcp-project-id",
bucket_name="your-input-bucket",
dataset="your_bigquery_dataset",
prefix="path/to/collected_fragmentomics_outputs/",
mode="per_output_type", # or "unified_long"
table_prefix="fragmentomics_",
write_disposition="WRITE_APPEND",
)
manifest