A Snakemake workflow for Bacterial Genome assembly and QC using PacBio HiFi reads from University of Bern NGS Platform.
This workflow automates the complete pipeline for bacterial genome assembly:
- QC on raw reads - NanoPlot generates comprehensive reports on sequencing data
- Assembly - Flye assembler produces high-quality long-read assemblies
- Assembly QC - Seqkit generates assembly statistics and summary reports
- Completeness assessment - BUSCO evaluates assembly completeness against conserved orthologs
- Taxonomic validation - PGAP performs taxonomic classification and validation
- Genome annotation - Bakta provides comprehensive prokaryotic genome annotation
- Modular design with separate rule files for each step
- Conda environment management for reproducibility
- Docker containerization for BUSCO and PGAP
- Comprehensive logging and output organization
- Automated sample discovery from CSV file
- Support for batch processing multiple samples
- Completeness assessment with BUSCO
- Taxonomic validation with PGAP
- Genome annotation with Bakta
- Summary tables for QC and taxonomic metrics
- Snakemake 7.0+
- Conda or Mamba
- PacBio HiFi BAM files with corresponding barcode information
- Clone this repository:
git clone https://github.com/yourusername/smk-bac-assembly.git
cd smk-bac-assembly- Install Snakemake (if not already installed):
conda create -n smk -c bioconda snakemake
conda activate smk- This workflow is prepared to use the CSV handed over by NGSP. Just copy it and store it as
config/samples.csv:
BioSample,Barcode
sample_1,bc2170--bc2170
sample_2,bc2171--bc2171
Rename/copy the config file cp config/config_template.yaml config/config.yaml, then modify it to set:
bam_dir: Path to your BAM filesoutdir: Output directory for resultsflye.genome_size: Approximate genome size for your organism- Thread and memory allocations for each tool
- Location of the Bakta Database
db_path
Run the complete workflow:
conda activate smk
cd workflow
snakemake --cores 32 --use-conda- Creates samtools index (.bai) files required by NanoPlot
- Index files are marked as temporary and automatically cleaned up
- Generates comprehensive sequencing statistics with NanoPlot
- Creates visualizations of read quality and length distributions
- Outputs HTML report and tabular statistics
- Converts BAM files to compressed FASTQ format
- Compressed files are temporary and cleaned after assembly
- Assembles long reads using Flye de novo assembler
- Produces assembly FASTA, assembly graph (GFA), and statistics
- Extracts quality metrics from assembled genomes
- Uses seqkit to calculate N50, contig count, GC content, etc.
- Combines QC metrics from all samples into a single summary table
- Facilitates comparison across all samples
- Evaluates assembly completeness using BUSCO against conserved bacterial orthologs
- Provides completeness scores and missing genes analysis
- Performs taxonomic classification using PGAP
- Validates assembly taxonomy with ANI-based methods
- Generates detailed taxonomic reports
- Creates consolidated QC summary from NanoPlot stats
- Generates BUSCO completeness summary across samples
- Generates taxonomic validation summary with top ANI hits
- Annotates assembled genomes using Bakta
- Identifies genes, CDS, tRNAs, tmRNAs, rRNAs, ncRNAs, and CRISPR arrays
- Identifies AMR genes with AMRFinderPlus
- Outputs annotation results in TSV format
- Extracts antimicrobial resistance genes from Bakta annotations
- Filters for NCBI Protein and NCBI Family entries
- Creates summary table with sample, contig, gene, and product information
results/
├── qc/
│ ├── sequencing/
│ │ └── nanoplot/
│ │ ├── report/
│ │ ├── stats/
│ │ ├── scatter/
│ │ └── nanoplot_qc_summary.tsv
│ └── assemblies/
│ ├── {sample}_assembly_stats.txt
│ └── assembly_qc_summary.txt
├── assemblies/
│ └── flye/
│ └── {sample}/
│ ├── assembly.fasta
│ ├── assembly_graph.gfa
│ └── assembly_info.txt
├── completeness/
│ └── busco/
│ ├── {sample}/
│ │ ├── full_table.tsv
│ │ ├── missing_busco_list.tsv
│ │ └── short_summary.txt
│ └── busco_summary_{lineage}.tsv
├── annotation/
│ ├── taxcheck/
│ │ ├── {sample}.ani-tax-report.txt
│ │ └── taxcheck_summary.tsv
│ └── bakta/
│ └── {sample}/
│ ├── {sample}.tsv
│ └── {sample}_amr_summary.tsv
├── benchmarks/
└── logs/
Test dataset in test_data/:
- PacBio HiFi 100 reads unaligned BAM files
- 2 Samples 77SG_Aer_FM and 137DG-Aer_FM
- NanoPlot - Long-read QC and visualization
- Flye - Long-read de novo assembly
- Seqkit - Sequence toolkit for statistics
- BUSCO - Benchmarking Universal Single-Copy Orthologs for completeness assessment
- PGAP - Prokaryotic Genome Annotation Pipeline for taxonomic validation
- Bakta - Rapid and standardized annotation of bacterial genomes
- Samtools - BAM file manipulation
If you use this workflow, please cite the respective tools:
- Flye: Kolmogorov M, et al. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol. 2019.
- NanoPlot: De Coster W, et al. NanoPlot: visualizing and correcting long reads. Bioinformatics. 2017.
- Seqkit: Shen W, et al. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS ONE. 2016.
- BUSCO: Seppey M, et al. BUSCO: assessing genome assembly and annotation completeness. Methods Mol Biol. 2019.
- PGAP: Tatusova T, et al. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res. 2016.
- Bakta: Schwengers O, et al. Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microb Genom. 2021.
This project is licensed under the MIT License - see LICENSE.md for details.
For issues, questions, or contributions, please use the GitHub Issues page or submit a pull request.
This workflow follows the WorkflowHub best practices for reproducible bioinformatics workflows.