GCP File-to-BigQuery Upload Toolkit

This repository contains a clean, reusable workflow for uploading files from a Google Cloud Storage (GCS) bucket into BigQuery tables.

The main use case is converting fragment-level data files, especially HDF5 (.h5) files, into pandas DataFrames and loading them into BigQuery. The workflow supports both:

Whole-file uploads for smaller files.
Chunked/batched uploads for large files.

Repository structure

gcp_bigquery_upload_repo/
├── notebooks/
│   └── upload_files_to_bigquery_template.ipynb
├── src/
│   └── gcp_bigquery_upload/
│       ├── __init__.py
│       └── upload.py
├── scripts/
│   └── upload_gcs_to_bigquery.py
├── examples/
│   └── config.example.yaml
├── requirements.txt
├── pyproject.toml
├── .gitignore
└── README.md

What this code does

The workflow can:

List files in a GCS bucket using a prefix and suffix.
Read .h5, .parquet, .csv, and .tsv files into pandas DataFrames.
Read large HDF5 files in chunks.
Decode byte-string columns.
Standardize chromosome formatting.
Add sample metadata columns based on the source file name.
Align DataFrames to an existing BigQuery table schema.
Upload to BigQuery with retry logic.
Optionally skip files already present in the destination table.
Optionally deduplicate a BigQuery table after upload.

Installation

Clone the repository and install dependencies:

git clone <your-repo-url>
cd gcp_bigquery_upload_repo
python -m pip install -r requirements.txt

For editable development:

python -m pip install -e .

Google Cloud authentication

Before running uploads, authenticate with Google Cloud:

gcloud auth application-default login

Set your default project if needed:

gcloud config set project YOUR_PROJECT_ID

The service account or user credentials must have permissions for:

Reading from the source GCS bucket.
Creating or appending to the destination BigQuery table.
Reading BigQuery table metadata if schema alignment or skipping existing files is used.

Quick start: notebook workflow

Open:

notebooks/upload_files_to_bigquery_template.ipynb

Edit the configuration cell:

PROJECT_ID = "your-project-id"
BUCKET_NAME = "your-input-bucket"
PREFIX = "path/to/files/"
FILE_SUFFIX = ".h5"

DESTINATION_TABLE = "your-project.your_dataset.your_table"

COLUMNS_TO_KEEP = [
    "chrom",
    "frag_start",
    "frag_end",
    "frag_lens",
    "upstream_motif",
    "downstream_motif",
    "variant_sites",
    "variant_status",
]

USE_BATCHES = True
CHUNK_SIZE = 1_000_000
WRITE_DISPOSITION = "WRITE_APPEND"

Then run the notebook cells.

Quick start: command line

Example HDF5 upload:

python scripts/upload_gcs_to_bigquery.py \
  --project-id your-project-id \
  --bucket-name your-input-bucket \
  --prefix path/to/files/ \
  --file-suffix .h5 \
  --destination-table your-project.your_dataset.your_table \
  --columns chrom frag_start frag_end frag_lens upstream_motif downstream_motif variant_sites variant_status \
  --use-batches \
  --chunk-size 1000000 \
  --write-disposition WRITE_APPEND

Example Parquet upload:

python scripts/upload_gcs_to_bigquery.py \
  --project-id your-project-id \
  --bucket-name your-input-bucket \
  --prefix path/to/parquet/ \
  --file-suffix .parquet \
  --destination-table your-project.your_dataset.your_table \
  --write-disposition WRITE_APPEND

Important parameters

Parameter	Meaning
`PROJECT_ID`	Google Cloud project used by the clients.
`BUCKET_NAME`	Source GCS bucket containing input files.
`PREFIX`	Optional path prefix inside the bucket.
`FILE_SUFFIX`	File type to upload, such as `.h5`, `.parquet`, `.csv`, or `.tsv`.
`DESTINATION_TABLE`	Fully qualified BigQuery table ID: `project.dataset.table`.
`COLUMNS_TO_KEEP`	Optional list of columns/datasets to keep. For HDF5 files, these are dataset keys.
`USE_BATCHES`	If `True`, upload large HDF5 files in chunks.
`CHUNK_SIZE`	Number of rows per chunk for HDF5 batch upload.
`WRITE_DISPOSITION`	BigQuery write mode: `WRITE_APPEND`, `WRITE_TRUNCATE`, or `WRITE_EMPTY`.
`SKIP_EXISTING`	If `True`, skip files already represented in the destination table.
`SOURCE_COLUMN`	Column used to track uploaded source files. Default: `source_file`.

Notes on HDF5 files

For HDF5 input, each top-level dataset is treated as one DataFrame column.

For example, an HDF5 file with datasets:

chrom
frag_start
frag_end
frag_lens
variant_status

becomes a DataFrame with the same columns.

If COLUMNS_TO_KEEP is provided, only those datasets are read.

BigQuery schema handling

If the destination table already exists, the uploader can align the DataFrame to the existing schema:

Missing table columns are added to the DataFrame as null.
Extra DataFrame columns are dropped.
Columns are reordered to match the table schema.

This is useful when uploading multiple files that may not have identical columns.

Optional metadata columns

The upload functions add:

source_file: GCS object path.
sample_id: basename of the source file with common suffixes removed.
sample_status: optional inferred status, if enabled.
sample_class: optional inferred class, if enabled.

You can customize this behavior in add_metadata_columns() in src/gcp_bigquery_upload/upload.py.

Optional deduplication

The module includes deduplicate_table() to create a cleaned BigQuery table using a ROW_NUMBER() partition strategy.

Example:

from google.cloud import bigquery
from gcp_bigquery_upload.upload import deduplicate_table

client = bigquery.Client(project="your-project-id")

deduplicate_table(
    bq_client=client,
    input_table="your-project.dataset.raw_table",
    output_table="your-project.dataset.clean_table",
    partition_columns=[
        "chrom",
        "frag_start",
        "frag_end",
        "source_file",
        "upstream_motif",
        "frag_lens",
        "variant_sites",
    ],
    prefer_column="variant_status",
    prefer_values=["mut", "non_mut"],
)

Development notes

This repository was cleaned from exploratory Jupyter notebooks. The notebook is now intended to be a runnable template, while the reusable logic lives in src/gcp_bigquery_upload/upload.py.

Recommended workflow:

Test on 1 small file or use MAX_FILES = 1.
Preview the DataFrame.
Confirm destination table and schema.
Run the full upload.
Optionally run deduplication or downstream BigQuery SQL cleanup.

Fragmentomics TSV outputs from `collect_fragmentomics_outputs.R`

The repository also includes a tentative workflow for uploading collected fragmentomics TSV outputs produced by collect_fragmentomics_outputs.R.

Notebook:

notebooks/upload_fragmentomics_tsv_outputs_to_bigquery.ipynb

Reusable functions:

src/gcp_bigquery_upload/fragmentomics.py

Supported collected-output TSV types

The upload helper recognizes these filename suffixes and maps each to an output_type:

R-script output suffix	output_type
`fragment_scores.tsv`	`fragment_scores`
`insert_size_summary.tsv`	`insert_size_summary`
`insert_size_counts.tsv`	`insert_size_counts`
`alignmentMetrics.tsv`	`alignment_metrics`
`per5Mb_fragment_ratios.tsv`	`per5mb_fragment_ratios`
`nucleosome_peak_distances.tsv`	`nucleosome_peak_distances`
`chromatin_accessibility_distances.tsv`	`chromatin_accessibility_distances`
`5prime_endmotif_frequencies.tsv`	`five_prime_endmotif_frequencies`
`3prime_endmotif_frequencies.tsv`	`three_prime_endmotif_frequencies`
`breakpoint_frequencies.tsv`	`breakpoint_frequencies`
`dinucleotide_frequencies__len150.tsv`	`dinucleotide_frequencies_len150`
`dinucleotide_frequencies__len167.tsv`	`dinucleotide_frequencies_len167`
`average_dinucleotide_frequencies.tsv`	`average_dinucleotide_frequencies`

Recommended upload modes

1. One table per output type

Recommended for most analysis workflows because each TSV keeps its native wide/long format.

Example tables:

your-project.your_dataset.fragmentomics_fragment_scores
your-project.your_dataset.fragmentomics_insert_size_summary
your-project.your_dataset.fragmentomics_per5mb_fragment_ratios

2. One unified long table

Recommended for cataloguing, exploratory browsing, and simple cross-output queries. Heterogeneous TSVs are melted into a shared schema with columns such as:

sample_id
output_type
source_file
metric_name
metric_value_string
metric_value_numeric

Output-specific identifier columns such as bin, seqnames, start, end, motif, context, and Position are retained when present.

Minimal example

from gcp_bigquery_upload.fragmentomics import upload_fragmentomics_outputs

manifest = upload_fragmentomics_outputs(
    project_id="your-gcp-project-id",
    bucket_name="your-input-bucket",
    dataset="your_bigquery_dataset",
    prefix="path/to/collected_fragmentomics_outputs/",
    mode="per_output_type",  # or "unified_long"
    table_prefix="fragmentomics_",
    write_disposition="WRITE_APPEND",
)

manifest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GCP File-to-BigQuery Upload Toolkit

Repository structure

What this code does

Installation

Google Cloud authentication

Quick start: notebook workflow

Quick start: command line

Important parameters

Notes on HDF5 files

BigQuery schema handling

Optional metadata columns

Optional deduplication

Development notes

Fragmentomics TSV outputs from `collect_fragmentomics_outputs.R`

Supported collected-output TSV types

Recommended upload modes

1. One table per output type

2. One unified long table

Minimal example

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
notebooks		notebooks
scripts		scripts
src/gcp_bigquery_upload		src/gcp_bigquery_upload
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

GCP File-to-BigQuery Upload Toolkit

Repository structure

What this code does

Installation

Google Cloud authentication

Quick start: notebook workflow

Quick start: command line

Important parameters

Notes on HDF5 files

BigQuery schema handling

Optional metadata columns

Optional deduplication

Development notes

Fragmentomics TSV outputs from collect_fragmentomics_outputs.R

Supported collected-output TSV types

Recommended upload modes

1. One table per output type

2. One unified long table

Minimal example

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Fragmentomics TSV outputs from `collect_fragmentomics_outputs.R`

Packages