VCF-RDFizer is a Docker-first CLI wrapper for:
- VCF -> RDF (N-Triples) with RMLStreamer
- Optional RDF compression/decompression
- Python 3.10+
- Docker (installed and running)
Install options:
pip install vcf-rdfizeror
pipx install vcf-rdfizeror
conda install -c conda-forge vcf-rdfizeror pull the prebuilt Docker image directly:
docker pull ecrum19/vcf-rdfizer:latest--out is required for all modes.
This is the run output root directory. VCF-RDFizer places:
- final RDF/compression outputs
- run metrics/logs
- hidden intermediates
inside this directory.
full: VCF -> TSV -> RDF -> compressiontsv: VCF -> TSV only (benchmarking)compress: compress an existing.ntdecompress: decompress.nt.gz,.nt.br, or.hdt
In full mode with multiple VCF inputs, failures are isolated per input:
- the run continues with remaining files
- failed inputs are summarized in
run_metrics/<RUN_ID>/failed_inputs.csv
-m, --mode {full,compress,decompress,tsv}-o, --outrequired output root directory-c, --compressionmethods:gzip,brotli,hdt,hdt_gzip,hdt_brotli,none-I, --imageDocker image repo (defaultecrum19/vcf-rdfizer)-v, --image-versionDocker tag/version-b, --buildforce Docker build-B, --no-buildfail if image not found-h, --helpshow full usage
-i, --inputrequired VCF file or directory-r, --rulesmapping rules file (.ttl)- default:
rules/default_rules.ttl
- default:
-l, --rdf-layout {aggregate,batch}required in full mode-P, --spark-partitionsoptional Spark partition hint (positive integer)- low-cost way to reduce output part count by setting
spark.default.parallelismandspark.sql.shuffle.partitions
- low-cost way to reduce output part count by setting
-k, --keep-tsvkeep hidden TSV intermediates-R, --keep-rdfkeep raw.ntafter compression-e, --estimate-sizepreflight size estimate
-i, --inputrequired VCF file or directory- Outputs per-run benchmark summary in
run_metrics/<RUN_ID>/tsv_metrics.csv - Raw TSV timing + artifact JSON per input in
run_metrics/<RUN_ID>/raw_metrics/tsv_*
-q, --rdf, --ntrequired input.ntfile
-C, --compressed-inputrequired.nt.gz,.nt.br, or.hdt-d, --decompress-outoptional explicit output.ntpath (must be inside--out)
Show help:
vcf-rdfizer --helpFull pipeline (aggregate RDF):
vcf-rdfizer \
--mode full \
--input ./vcf_files \
--rdf-layout aggregate \
--out ./resultsFull pipeline (batch RDF parts):
vcf-rdfizer \
--mode full \
--input ./vcf_files \
--rdf-layout batch \
--compression hdt \
--out ./resultsFull pipeline with low-cost partition cap (helps avoid too many tiny batch files):
vcf-rdfizer \
--mode full \
--input ./vcf_files \
--rdf-layout batch \
--spark-partitions 8 \
--compression hdt \
--out ./resultsFull pipeline with custom rules + keep RDF:
vcf-rdfizer \
--mode full \
--input ./vcf_files \
--rules ./rules/my_rules.ttl \
--rdf-layout aggregate \
--compression hdt,brotli \
--keep-rdf \
--out ./resultsTSV-only benchmark:
vcf-rdfizer \
--mode tsv \
--input ./vcf_files \
--out ./resultsCompression-only:
vcf-rdfizer \
--mode compress \
--rdf ./results/sample/sample.nt \
--compression hdt_gzip \
--out ./resultsDecompression-only:
vcf-rdfizer \
--mode decompress \
--compressed-input ./results/sample/sample.hdt \
--out ./resultsGiven --out ./results:
- final outputs:
./results/<sample>/...
- per-run metrics/logs:
./results/run_metrics/<RUN_ID>/...
- hidden intermediates:
./results/.intermediate/tsv/
Intermediates are hidden by default.
Raw .nt files are removed after compression unless --keep-rdf is provided.
For each run, VCF-RDFizer writes:
run_metrics/<RUN_ID>/metrics.csvrun_metrics/<RUN_ID>/wrapper_execution_times.csvrun_metrics/<RUN_ID>/progress.log
Compression metrics now include per-method:
wall_seconds_*user_seconds_*sys_seconds_*max_rss_kb_*
- default rules file:
rules/default_rules.ttl - rules guide:
rules/README.md
If Docker permission issues occur, rerun with a Docker-allowed user (or configure Docker group/sudo access on your system).
If HDT compression fails on very large .nt files, use batch layout and/or non-HDT compression methods.
Safe termination:
- Press
Ctrl+Cto interrupt a run. - The wrapper exits with code
130, writes progress torun_metrics/<RUN_ID>/progress.log, and performs best-effort cleanup of tracked intermediates. - Raw RDF cleanup on interrupt follows
--keep-rdf:- with
--keep-rdf, raw.ntfiles are preserved - without
--keep-rdf, tracked raw.ntfiles are removed during interrupt cleanup
- with
If you use VCF-RDFizer in a publication, please cite:
VCF-RDFizer maintainers. (2026). VCF-RDFizer (Version 1.1.0) [Computer software]. GitHub. https://github.com/ecrum19/VCF-RDFizer
BibTeX:
@software{vcf_rdfizer_2026,
author = {{VCF-RDFizer maintainers}},
title = {VCF-RDFizer},
year = {2026},
version = {1.1.0},
url = {https://github.com/ecrum19/VCF-RDFizer},
note = {Computer software}
}You can also use the machine-readable citation file: CITATION.cff.
Contributions are welcome. If you want to improve VCF-RDFizer:
- Open an issue first for bug reports, feature requests, or design changes.
- Fork the repo and create a feature branch from
main. - Keep changes focused and include/update tests for behavior changes.
- Run the unit tests locally before opening a PR:
python3 -m unittest discover -s test -p "test_*_unit.py" -q- In your PR, include what changed, why it changed, and how you validated it.
- Use clear commit messages (for Docker publish control, include
[publish-docker]only when intended).
- Project license:
LICENSE(MIT) - Third-party runtime notices:
THIRD_PARTY_NOTICES.md