Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- RTGtools update to 3.13 [#261](https://github.com/nf-core/variantbenchmarking/issues/261)
- Transforming local modules with bcftools and tabix to standard nf-core modules [#267](https://github.com/nf-core/variantbenchmarking/pull/267)
- Replace local modules SORT_BED and REFORMAT_HEADER with nf-core ones. [#268](https://github.com/nf-core/variantbenchmarking/pull/268)
- Introducing a new subworkflow to generate truth vcf with an ensemble approach. Test VCFs are being merged and according to ensembl_truth rule (the minimum number of callers to agree) a new truth set is created. This apporach is especially important and needed for somatic benchmarks where truth is often missing. [#276](https://github.com/nf-core/variantbenchmarking/pull/276)

### `Fixed`

- increasing font sizes, making labelling optional and some fixes around plots. Tests are editted to observe optional plot arguments. [#270](https://github.com/nf-core/variantbenchmarking/pull/270)
- Improving the pipeline towards strict syntax health & adding topic channels - 1 [#272](https://github.com/nf-core/variantbenchmarking/pull/272)
- Fixing the bed file bug in concordance analysis [#260](https://github.com/nf-core/variantbenchmarking/pull/275)
- Missing --sample for meta.id is fixed in BCFTOOLS_REHEADER [#276](https://github.com/nf-core/variantbenchmarking/pull/276)

### `Dependencies`

Expand Down
42 changes: 30 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,28 +27,45 @@

The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from [nf-core/modules](https://github.com/nf-core/modules) in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!

<p align="center">
<img title="variantbenchmarking metro map" src="docs/images/variantbenchmarking_metromap.png" width=100%>
</p>
<picture>
<source media="(prefers-color-scheme: dark)" srcset="docs/images/variantbenchmarking.svg">
<img alt="nf-core/variantbenchmarking metro map" src="docs/images/variantbenchmarking.svg">
</picture>

The workflow involves several key processes to ensure reliable and reproducible results as follows:

### Standardization and normalization of variants:
### Standardization and normalization of test (query/comparison) variants:

This initial step ensures consistent formatting and alignment of variants in test and truth VCF files for accurate comparison.

- Subsample if input test vcf is multisample ([bcftools view](https://samtools.github.io/bcftools/bcftools.html#view))
- Subsample if input vcf is multisample ([bcftools view](https://samtools.github.io/bcftools/bcftools.html#view))
- Homogenization of multi-allelic variants, MNPs and SVs (including imprecise paired breakends and single breakends) ([variant-extractor](https://github.com/EUCANCan/variant-extractor))
- Reformatting test VCF files from different SV callers ([svync](https://github.com/nvnieuwk/svync))
- Reformatting VCF files from different SV callers ([svync](https://github.com/nvnieuwk/svync))
- Standardize SV variants to BND ([SVTK standardize](https://github.com/broadinstitute/gatk-sv/blob/main/src/svtk/scripts/svtk))
- Decompose SVs to BND [rtgtools svdecompose](https://cn.animalgenome.org/bioinfo/resources/manuals/RTGOperationsManual.pdf)
- Rename sample names in test and truth VCF files ([bcftools reheader](https://samtools.github.io/bcftools/bcftools.html#reheader))
- Splitting multi-allelic variants in test and truth VCF files ([bcftools norm](https://samtools.github.io/bcftools/bcftools.html#norm))
- Deduplication of variants in test and truth VCF files ([bcftools norm](https://samtools.github.io/bcftools/bcftools.html#norm))
- Left aligning of variants in test and truth VCF files ([bcftools norm](https://samtools.github.io/bcftools/bcftools.html#norm))
- Use prepy in order to normalize test files. This option is only applicable for happy benchmarking of germline analysis ([prepy](https://github.com/Illumina/hap.py/tree/master))
- Rename sample names ([bcftools reheader](https://samtools.github.io/bcftools/bcftools.html#reheader))
- Splitting multi-allelic variants([bcftools norm](https://samtools.github.io/bcftools/bcftools.html#norm))
- Deduplication of variants ([bcftools norm](https://samtools.github.io/bcftools/bcftools.html#norm))
- Left aligning of variants ([bcftools norm](https://samtools.github.io/bcftools/bcftools.html#norm))
- Use prepy in order to normalize. This option is only applicable for happy benchmarking of germline analysis ([prepy](https://github.com/Illumina/hap.py/tree/master))
- Split SNVs and indels if the given test VCF contains both. This is only applicable for somatic analysis ([bcftools view](https://samtools.github.io/bcftools/bcftools.html#view))

### Standardization and normalization of truth (baseline) variants:

- Decompose SVs to BND [rtgtools svdecompose](https://cn.animalgenome.org/bioinfo/resources/manuals/RTGOperationsManual.pdf)
- Rename sample names ([bcftools reheader](https://samtools.github.io/bcftools/bcftools.html#reheader))
- Splitting multi-allelic variants ([bcftools norm](https://samtools.github.io/bcftools/bcftools.html#norm))
- Deduplication of variants ([bcftools norm](https://samtools.github.io/bcftools/bcftools.html#norm))
- Left aligning of variants([bcftools norm](https://samtools.github.io/bcftools/bcftools.html#norm))

### Ensemble (majority rule) approcah to prepare truth variants:

In cases where a gold standard truth VCF file is unavailable, a common approach is to create an ensemble of test variants using a majority rule. This method retains variants identified by more than one out of the $n$ total variant callers. If $--ensemble_thruth$ > 0:

- Merge small (SNVs and INDELs) using ([bcftools merge](https://samtools.github.io/bcftools/bcftools.html#merge))
- Merge Structual Variants using ([SURVIVOR merge](https://github.com/fritzsedlazeck/SURVIVOR/wiki))
- Filtering the variants according to $--ensemble_thruth$.

### Filtering options:

Applying filtering on the process of benchmarking itself might makes it impossible to compare different benchmarking strategies. Therefore, for whom like to compare benchmarking methods this subworkflow aims to provide filtering options for variants.
Expand Down Expand Up @@ -80,7 +97,7 @@ Available methods for germline and somatic _structural variant (SV)_ benchmarkin

- Truvari ([truvari bench](https://github.com/acenglish/truvari/wiki/bench))
- SVanalyzer ([svanalyzer benchmark](https://github.com/nhansen/SVanalyzer/blob/master/docs/svbenchmark.rst))
- Rtgtools (only for BND) ([rtg bndeval](https://realtimegenomics.com/products/rtg-tools))
- RTGtools (only for BND) ([rtg bndeval](https://realtimegenomics.com/products/rtg-tools))

> [!NOTE]
> Please note that there is no somatic specific tool for SV benchmarking in this pipeline.
Expand Down Expand Up @@ -201,6 +218,7 @@ We thank the following people for their extensive assistance in the development

- Nicolas Vannieuwkerke ([@nvnienwk](https://github.com/nvnieuwk))
- Maxime Garcia ([@maxulysse](https://github.com/maxulysse))
- Georgia Kesisoglou ([@georgiakes](https://github.com/georgiakes))
- Sameesh Kher ([@khersameesh24](https://github.com/khersameesh24))
- Florian Heyl ([@heylf](https://github.com/heyl))
- Krešimir Beštak ([@kbestak](https://github.com/kbestak))
Expand Down
100 changes: 86 additions & 14 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,27 @@ process {
}

withName: BCFTOOLS_VIEW_FILTERMISSING {
ext.prefix = { vcf.baseName - ".vcf" + ".filtermissing" }
ext.args = """--output-type z --include 'GT="alt"'"""
ext.prefix = { vcf.baseName - ".vcf" + ".filtermissing" }
ext.args = {
if (params.analysis == "somatic" && (meta.caller == "strelka" || meta.caller == "manta")) {
"--output-type z --write-index=tbi --include 'SOMATIC=1'"
} else if (meta.caller == "manta" || meta.caller == "lumpy") {
"--output-type z --write-index=tbi"
} else {
"--output-type z --write-index=tbi --include 'GT=\"alt\"'"
}
}
publishDir = [
path: {"${params.outdir}/${params.variant_type}/${meta.id}/preprocess"},
pattern: "*{.vcf.gz,vcf.gz.tbi}",
mode: params.publish_dir_mode
]
}

withName: INJECT_MISSING_GT {
ext.prefix = { input[0].baseName + '.withGT' }
ext.suffix = "vcf"
ext.args = '-v OFS=\'\\t\' \'/^##/ {print $0; next} /^#CHROM/ {print "##FORMAT=<ID=GT,Number=1,Type=String,Description=\\x22Genotype\\x22>"; print $0; next} { $9 = "GT:" $9; for(i=10; i<=NF; i++) $i = "0/1:" $i; print }\''
publishDir = [
enabled: false
]
Expand Down Expand Up @@ -149,11 +168,11 @@ process {
publishDir = [
enabled: false
]

}

withName: "BCFTOOLS_REHEADER*" {
ext.args2 = {"--output-type z --write-index=tbi" }
ext.args = { "--samples <(echo '${meta.id}')" }
Comment thread
georgiakes marked this conversation as resolved.
ext.args2 = {"--output-type z --write-index=tbi" }
ext.prefix = { vcf.baseName - ".vcf" + ".reheader"}
publishDir = [
enabled: false
Expand Down Expand Up @@ -243,7 +262,59 @@ process {
]
}

withName: 'PUBLISH_PROCESSED_VCF' {
Comment thread
kubranarci marked this conversation as resolved.
// majority rule, ensemble analysis
withName: BCFTOOLS_ENSEMBLE {
ext.prefix = {"${meta.id}.ensemble"}
ext.args = {"--output-type z --write-index=tbi --force-samples"}
publishDir = [
enabled: false
]
}

withName: FILTER_MAJORITY {
ext.prefix = {"${meta.id}.majority"}
ext.args = {"--output-type v -i 'COUNT(GT=\"alt\") >= ${params.ensemble_truth}'"}
publishDir = [
enabled: false
]
}

withName: REFORMAT_TRUTH {
ext.prefix = { input[0].baseName + '.reformatted' }
ext.suffix = "vcf"
ext.args = '-v OFS=\'\\t\' \'/^##/ {print $0; next} /^#CHROM/ {print $1, $2, $3, $4, $5, $6, $7, $8, $9, "TRUTH"; next} {print $1, $2, $3, $4, $5, $6, $7, $8, "GT", "0/1"}\''
publishDir = [
enabled: false
]
}

withName: SURVIVOR_ENSEMBLE {
ext.prefix = {"${meta.id}.ensemble"}
publishDir = [
enabled: false
]
}

withName: REFORMAT_TRUTH_SV {
ext.prefix = { input[0].baseName + '.reformatted' }
ext.suffix = "vcf"
ext.args = '-v OFS=\'\\t\' \'/^##INFO=<ID=CIPOS/ {next} /^##INFO=<ID=CIEND/ {next} /^##/ {print $0; next} /^#CHROM/ {print "##INFO=<ID=CIPOS,Number=2,Type=Integer,Description=\\x22Confidence interval around POS\\x22>"; print "##INFO=<ID=CIEND,Number=2,Type=Integer,Description=\\x22Confidence interval around END\\x22>"; print $1, $2, $3, $4, $5, $6, $7, $8, $9, "TRUTH"; next} {print $1, $2, $3, $4, $5, $6, $7, $8, "GT", "0/1"}\''
publishDir = [
enabled: false
]
}

withName: 'TABIX_BGZIPTABIX_SMALL' {
publishDir = [
path: {"${params.outdir}/${params.variant_type}/${meta.id}/preprocess"},
pattern: "*{.vcf.gz,vcf.gz.tbi}",
mode: params.publish_dir_mode
]
}

withName: BCFTOOLS_SORT_SV {
ext.prefix = { vcf.baseName - ".vcf" + ".sort"}
ext.args = {"--output-type z --write-index=tbi" }
publishDir = [
path: {"${params.outdir}/${params.variant_type}/${meta.id}/preprocess"},
pattern: "*{.vcf.gz,vcf.gz.tbi}",
Expand Down Expand Up @@ -289,7 +360,7 @@ process {

// squash-ploidy is necessary to be able to match het-hom changes
withName: "RTGTOOLS_VCFEVAL" {
ext.prefix = {"${meta.id}.${params.truth_id}.${meta.caller}"}
ext.prefix = {params.truth_id ? "${meta.id}.${params.truth_id}.${meta.caller}" : "${meta.id}.truth.${meta.caller}" }
ext.args = {["--all-record ",
(params.analysis == somatic) ? '--squash-ploidy' : ''
].join('').trim()
Expand All @@ -302,7 +373,7 @@ process {
}

withName: "RTGTOOLS_BNDEVAL" {
ext.prefix = {"${meta.id}.${params.truth_id}.${meta.caller}"}
ext.prefix = {params.truth_id ? "${meta.id}.${params.truth_id}.${meta.caller}" : "${meta.id}.truth.${meta.caller}" }
publishDir = [
path: {"${params.outdir}/${params.variant_type}/${meta.id}/benchmarks/rtgtools"},
pattern: "*{.vcf.gz,vcf.gz.tbi,tsv.gz,txt}",
Expand All @@ -311,7 +382,7 @@ process {
}

withName: "HAPPY_HAPPY" {
ext.prefix = {"${meta.id}.${params.truth_id}.${meta.caller}"}
ext.prefix = {params.truth_id ? "${meta.id}.${params.truth_id}.${meta.caller}" : "${meta.id}.truth.${meta.caller}" }
//ext.args = {""}
publishDir = [
path: {"${params.outdir}/${params.variant_type}/${meta.id}/benchmarks/happy"},
Expand All @@ -321,7 +392,7 @@ process {
}

withName: "HAPPY_SOMPY" {
ext.prefix = {"${meta.id}.${params.truth_id}.${meta.caller}"}
ext.prefix = {params.truth_id ? "${meta.id}.${params.truth_id}.${meta.caller}" : "${meta.id}.truth.${meta.caller}" }
ext.args = { meta.caller.contains("strelka") || meta.caller.contains("varscan") || meta.caller.contains("pisces") || meta.caller == "mutect" ? "--feature-table hcc.${meta.caller}.${params.variant_type} --bin-afs" : "--feature-table generic" }
publishDir = [
path: {"${params.outdir}/${params.variant_type}/${meta.id}/benchmarks/sompy"},
Expand All @@ -331,22 +402,22 @@ process {
}

withName: "SPLIT_SOMPY_FEATURES" {
ext.prefix = {"${meta.id}.${params.truth_id}.${meta.caller}"}
ext.prefix = {params.truth_id ? "${meta.id}.${params.truth_id}.${meta.caller}" : "${meta.id}.truth.${meta.caller}" }
publishDir = [
enabled: false
]
}

withName: "HAPPY_PREPY" {
ext.prefix = {"${meta.id}.${params.truth_id}.${meta.caller}.prepy"}
ext.prefix = {"${meta.id}.${meta.caller}.prepy"}
ext.args = {"--fixchr --filter-nonref --bcftools-norm"}
publishDir = [
enabled: false
]
}

withName: "TRUVARI_BENCH" {
ext.prefix = {"${meta.id}.${params.truth_id}.${meta.caller}"}
ext.prefix = {params.truth_id ? "${meta.id}.${params.truth_id}.${meta.caller}" : "${meta.id}.truth.${meta.caller}" }
ext.args = {[
"--sizemin 0 --sizefilt 0 --sizemax -1",
(meta.pctsize != null) ? " --pctsize ${meta.pctsize}" : '',
Expand All @@ -365,7 +436,7 @@ process {
}

withName: SVANALYZER_SVBENCHMARK {
ext.prefix = {"${meta.id}.${params.truth_id}.${meta.caller}"}
ext.prefix = {params.truth_id ? "${meta.id}.${params.truth_id}.${meta.caller}" : "${meta.id}.truth.${meta.caller}" }
ext.args = {[
(meta.normshift != null) ? " -normshift ${meta.normshift}" : '',
(meta.normdist != null) ? " -normdist ${meta.normdist}" : '',
Expand Down Expand Up @@ -433,7 +504,7 @@ process {
}

withName: WITTYER {
ext.prefix = {"${meta.id}.${params.truth_id}.${meta.caller}"}
ext.prefix = {params.truth_id ? "${meta.id}.${params.truth_id}.${meta.caller}" : "${meta.id}.truth.${meta.caller}" }
ext.args = {[
"--includedFilters=''",
(meta.evaluationmode ) ? " -em ${meta.evaluationmode}" : '',
Expand Down Expand Up @@ -604,6 +675,7 @@ process {
]
}


withName: VCF_TO_CSV {
ext.prefix = {"${meta.id}.${meta.tag}"}
publishDir = [
Expand Down
1 change: 0 additions & 1 deletion conf/tests/germline_sv.config
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,6 @@ params {
variant_type = "structural"
method = 'svanalyzer,truvari,wittyer'
preprocess = "split_multiallelic,normalize,deduplicate"
sv_standardization = "svtk"
exclude_expression = 'INFO/SVTYPE="BND" || INFO/SVTYPE="TRA"'
min_sv_size = 30
truth_id = "HG002"
Expand Down
2 changes: 1 addition & 1 deletion conf/tests/liftover_test.config
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ process {
resourceLimits = [
cpus: 4,
memory: '15.GB',
time: '1.h'
time: '2.h'
]

withName: 'BCFTOOLS_NORM*' {
Expand Down
2 changes: 1 addition & 1 deletion conf/tests/liftover_truth.config
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ params {
analysis = 'germline'
variant_type = "small"
method = 'rtgtools,happy'
preprocess = "normalize,deduplicate,prepy"
preprocess = "normalize,deduplicate"
skip_plots = "svlength,upset,metrics"

truth_id = "HG002"
Expand Down
58 changes: 58 additions & 0 deletions conf/tests/somatic_snv_ensemble.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Nextflow config file for running minimal tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Defines input files and everything required to run a fast and simple pipeline test.

Use as follows:
nextflow run nf-core/variantbenchmarking -profile somatic_snv_ensemble,<docker/singularity> --outdir <OUTDIR>

----------------------------------------------------------------------------------------
*/

process {
resourceLimits = [
cpus: 4,
memory: '15.GB',
time: '1.h'
]

withName: 'BCFTOOLS_NORM*' {
cpus = { 1 }
memory = { 6.GB * task.attempt }
time = { 4.h * task.attempt }
}
withName: 'BCFTOOLS_FILTER*' {
cpus = { 1 }
memory = { 6.GB * task.attempt }
time = { 4.h * task.attempt }
}
withName: 'BCFTOOLS_SORT*' {
cpus = { 1 }
memory = { 6.GB * task.attempt }
time = { 4.h * task.attempt }
}
}

params {
config_profile_name = 'Test profile: somatic_snv_ensemble'
config_profile_description = 'Minimal test dataset to check pipeline function'

// Input data
test_data_base = 'https://raw.githubusercontent.com/nf-core/test-datasets/variantbenchmarking'
input = "${params.test_data_base}/samplesheet/samplesheet_snv_somatic_hg38.csv"
outdir = 'results'

// Genome references
genome = 'GRCh38'
analysis = 'somatic'
method = 'sompy'
preprocess = "normalize,filter_contigs"
include_expression = 'TYPE="snp"'

variant_type = "snv"
ensemble_truth = 2
regions_bed = ""
Comment thread
georgiakes marked this conversation as resolved.
truth_id = ""
Comment thread
georgiakes marked this conversation as resolved.

}
2 changes: 1 addition & 1 deletion conf/tests/somatic_sv.config
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ process {
}
}
params {
config_profile_name = 'Test profile'
config_profile_name = 'Test profile: somatic_sv'
config_profile_description = 'Minimal test dataset to check pipeline function'

// Input data
Expand Down
Loading
Loading