Skip to content

pughlab/gcp-code

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GCP File-to-BigQuery Upload Toolkit

This repository contains a clean, reusable workflow for uploading files from a Google Cloud Storage (GCS) bucket into BigQuery tables.

The main use case is converting fragment-level data files, especially HDF5 (.h5) files, into pandas DataFrames and loading them into BigQuery. The workflow supports both:

  1. Whole-file uploads for smaller files.
  2. Chunked/batched uploads for large files.

Repository structure

gcp_bigquery_upload_repo/
├── notebooks/
│   └── upload_files_to_bigquery_template.ipynb
├── src/
│   └── gcp_bigquery_upload/
│       ├── __init__.py
│       └── upload.py
├── scripts/
│   └── upload_gcs_to_bigquery.py
├── examples/
│   └── config.example.yaml
├── requirements.txt
├── pyproject.toml
├── .gitignore
└── README.md

What this code does

The workflow can:

  • List files in a GCS bucket using a prefix and suffix.
  • Read .h5, .parquet, .csv, and .tsv files into pandas DataFrames.
  • Read large HDF5 files in chunks.
  • Decode byte-string columns.
  • Standardize chromosome formatting.
  • Add sample metadata columns based on the source file name.
  • Align DataFrames to an existing BigQuery table schema.
  • Upload to BigQuery with retry logic.
  • Optionally skip files already present in the destination table.
  • Optionally deduplicate a BigQuery table after upload.

Installation

Clone the repository and install dependencies:

git clone <your-repo-url>
cd gcp_bigquery_upload_repo
python -m pip install -r requirements.txt

For editable development:

python -m pip install -e .

Google Cloud authentication

Before running uploads, authenticate with Google Cloud:

gcloud auth application-default login

Set your default project if needed:

gcloud config set project YOUR_PROJECT_ID

The service account or user credentials must have permissions for:

  • Reading from the source GCS bucket.
  • Creating or appending to the destination BigQuery table.
  • Reading BigQuery table metadata if schema alignment or skipping existing files is used.

Quick start: notebook workflow

Open:

notebooks/upload_files_to_bigquery_template.ipynb

Edit the configuration cell:

PROJECT_ID = "your-project-id"
BUCKET_NAME = "your-input-bucket"
PREFIX = "path/to/files/"
FILE_SUFFIX = ".h5"

DESTINATION_TABLE = "your-project.your_dataset.your_table"

COLUMNS_TO_KEEP = [
    "chrom",
    "frag_start",
    "frag_end",
    "frag_lens",
    "upstream_motif",
    "downstream_motif",
    "variant_sites",
    "variant_status",
]

USE_BATCHES = True
CHUNK_SIZE = 1_000_000
WRITE_DISPOSITION = "WRITE_APPEND"

Then run the notebook cells.

Quick start: command line

Example HDF5 upload:

python scripts/upload_gcs_to_bigquery.py \
  --project-id your-project-id \
  --bucket-name your-input-bucket \
  --prefix path/to/files/ \
  --file-suffix .h5 \
  --destination-table your-project.your_dataset.your_table \
  --columns chrom frag_start frag_end frag_lens upstream_motif downstream_motif variant_sites variant_status \
  --use-batches \
  --chunk-size 1000000 \
  --write-disposition WRITE_APPEND

Example Parquet upload:

python scripts/upload_gcs_to_bigquery.py \
  --project-id your-project-id \
  --bucket-name your-input-bucket \
  --prefix path/to/parquet/ \
  --file-suffix .parquet \
  --destination-table your-project.your_dataset.your_table \
  --write-disposition WRITE_APPEND

Important parameters

Parameter Meaning
PROJECT_ID Google Cloud project used by the clients.
BUCKET_NAME Source GCS bucket containing input files.
PREFIX Optional path prefix inside the bucket.
FILE_SUFFIX File type to upload, such as .h5, .parquet, .csv, or .tsv.
DESTINATION_TABLE Fully qualified BigQuery table ID: project.dataset.table.
COLUMNS_TO_KEEP Optional list of columns/datasets to keep. For HDF5 files, these are dataset keys.
USE_BATCHES If True, upload large HDF5 files in chunks.
CHUNK_SIZE Number of rows per chunk for HDF5 batch upload.
WRITE_DISPOSITION BigQuery write mode: WRITE_APPEND, WRITE_TRUNCATE, or WRITE_EMPTY.
SKIP_EXISTING If True, skip files already represented in the destination table.
SOURCE_COLUMN Column used to track uploaded source files. Default: source_file.

Notes on HDF5 files

For HDF5 input, each top-level dataset is treated as one DataFrame column.

For example, an HDF5 file with datasets:

chrom
frag_start
frag_end
frag_lens
variant_status

becomes a DataFrame with the same columns.

If COLUMNS_TO_KEEP is provided, only those datasets are read.

BigQuery schema handling

If the destination table already exists, the uploader can align the DataFrame to the existing schema:

  • Missing table columns are added to the DataFrame as null.
  • Extra DataFrame columns are dropped.
  • Columns are reordered to match the table schema.

This is useful when uploading multiple files that may not have identical columns.

Optional metadata columns

The upload functions add:

  • source_file: GCS object path.
  • sample_id: basename of the source file with common suffixes removed.
  • sample_status: optional inferred status, if enabled.
  • sample_class: optional inferred class, if enabled.

You can customize this behavior in add_metadata_columns() in src/gcp_bigquery_upload/upload.py.

Optional deduplication

The module includes deduplicate_table() to create a cleaned BigQuery table using a ROW_NUMBER() partition strategy.

Example:

from google.cloud import bigquery
from gcp_bigquery_upload.upload import deduplicate_table

client = bigquery.Client(project="your-project-id")

deduplicate_table(
    bq_client=client,
    input_table="your-project.dataset.raw_table",
    output_table="your-project.dataset.clean_table",
    partition_columns=[
        "chrom",
        "frag_start",
        "frag_end",
        "source_file",
        "upstream_motif",
        "frag_lens",
        "variant_sites",
    ],
    prefer_column="variant_status",
    prefer_values=["mut", "non_mut"],
)

Development notes

This repository was cleaned from exploratory Jupyter notebooks. The notebook is now intended to be a runnable template, while the reusable logic lives in src/gcp_bigquery_upload/upload.py.

Recommended workflow:

  1. Test on 1 small file or use MAX_FILES = 1.
  2. Preview the DataFrame.
  3. Confirm destination table and schema.
  4. Run the full upload.
  5. Optionally run deduplication or downstream BigQuery SQL cleanup.

Fragmentomics TSV outputs from collect_fragmentomics_outputs.R

The repository also includes a tentative workflow for uploading collected fragmentomics TSV outputs produced by collect_fragmentomics_outputs.R.

Notebook:

notebooks/upload_fragmentomics_tsv_outputs_to_bigquery.ipynb

Reusable functions:

src/gcp_bigquery_upload/fragmentomics.py

Supported collected-output TSV types

The upload helper recognizes these filename suffixes and maps each to an output_type:

R-script output suffix output_type
fragment_scores.tsv fragment_scores
insert_size_summary.tsv insert_size_summary
insert_size_counts.tsv insert_size_counts
alignmentMetrics.tsv alignment_metrics
per5Mb_fragment_ratios.tsv per5mb_fragment_ratios
nucleosome_peak_distances.tsv nucleosome_peak_distances
chromatin_accessibility_distances.tsv chromatin_accessibility_distances
5prime_endmotif_frequencies.tsv five_prime_endmotif_frequencies
3prime_endmotif_frequencies.tsv three_prime_endmotif_frequencies
breakpoint_frequencies.tsv breakpoint_frequencies
dinucleotide_frequencies__len150.tsv dinucleotide_frequencies_len150
dinucleotide_frequencies__len167.tsv dinucleotide_frequencies_len167
average_dinucleotide_frequencies.tsv average_dinucleotide_frequencies

Recommended upload modes

1. One table per output type

Recommended for most analysis workflows because each TSV keeps its native wide/long format.

Example tables:

your-project.your_dataset.fragmentomics_fragment_scores
your-project.your_dataset.fragmentomics_insert_size_summary
your-project.your_dataset.fragmentomics_per5mb_fragment_ratios

2. One unified long table

Recommended for cataloguing, exploratory browsing, and simple cross-output queries. Heterogeneous TSVs are melted into a shared schema with columns such as:

sample_id
output_type
source_file
metric_name
metric_value_string
metric_value_numeric

Output-specific identifier columns such as bin, seqnames, start, end, motif, context, and Position are retained when present.

Minimal example

from gcp_bigquery_upload.fragmentomics import upload_fragmentomics_outputs

manifest = upload_fragmentomics_outputs(
    project_id="your-gcp-project-id",
    bucket_name="your-input-bucket",
    dataset="your_bigquery_dataset",
    prefix="path/to/collected_fragmentomics_outputs/",
    mode="per_output_type",  # or "unified_long"
    table_prefix="fragmentomics_",
    write_disposition="WRITE_APPEND",
)

manifest

About

Google Cloud Platform (GCP) script collection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors