Skip to content

jdzm/smk-bac-assembly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

smk-bac-assembly

Snakemake Bacterial Assembly Workflow

A Snakemake workflow for Bacterial Genome assembly and QC using PacBio HiFi reads from University of Bern NGS Platform.

Overview

This workflow automates the complete pipeline for bacterial genome assembly:

  1. QC on raw reads - NanoPlot generates comprehensive reports on sequencing data
  2. Assembly - Flye assembler produces high-quality long-read assemblies
  3. Assembly QC - Seqkit generates assembly statistics and summary reports
  4. Completeness assessment - BUSCO evaluates assembly completeness against conserved orthologs
  5. Taxonomic validation - PGAP performs taxonomic classification and validation
  6. Genome annotation - Bakta provides comprehensive prokaryotic genome annotation

Features

  • Modular design with separate rule files for each step
  • Conda environment management for reproducibility
  • Docker containerization for BUSCO and PGAP
  • Comprehensive logging and output organization
  • Automated sample discovery from CSV file
  • Support for batch processing multiple samples
  • Completeness assessment with BUSCO
  • Taxonomic validation with PGAP
  • Genome annotation with Bakta
  • Summary tables for QC and taxonomic metrics

Requirements

  • Snakemake 7.0+
  • Conda or Mamba
  • PacBio HiFi BAM files with corresponding barcode information

Installation

  1. Clone this repository:
git clone https://github.com/yourusername/smk-bac-assembly.git
cd smk-bac-assembly
  1. Install Snakemake (if not already installed):
conda create -n smk -c bioconda snakemake
conda activate smk

Configuration

Sample Setup

  1. This workflow is prepared to use the CSV handed over by NGSP. Just copy it and store it as config/samples.csv:
BioSample,Barcode
sample_1,bc2170--bc2170
sample_2,bc2171--bc2171

Config File

Rename/copy the config file cp config/config_template.yaml config/config.yaml, then modify it to set:

  • bam_dir: Path to your BAM files
  • outdir: Output directory for results
  • flye.genome_size: Approximate genome size for your organism
  • Thread and memory allocations for each tool
  • Location of the Bakta Database db_path

Usage

Run the complete workflow:

conda activate smk
cd workflow
snakemake --cores 32 --use-conda

Workflow Steps

1. Index BAM Files (index_bam)

  • Creates samtools index (.bai) files required by NanoPlot
  • Index files are marked as temporary and automatically cleaned up

2. Quality Control on Raw Reads (nanoplot_reads_qc)

  • Generates comprehensive sequencing statistics with NanoPlot
  • Creates visualizations of read quality and length distributions
  • Outputs HTML report and tabular statistics

3. Prepare FASTQ (prepare_fastq)

  • Converts BAM files to compressed FASTQ format
  • Compressed files are temporary and cleaned after assembly

4. Genome Assembly (flye_assembly)

  • Assembles long reads using Flye de novo assembler
  • Produces assembly FASTA, assembly graph (GFA), and statistics

5. Assembly QC (assembly_qc)

  • Extracts quality metrics from assembled genomes
  • Uses seqkit to calculate N50, contig count, GC content, etc.

6. Summary Report (assembly_qc_summary)

  • Combines QC metrics from all samples into a single summary table
  • Facilitates comparison across all samples

7. Completeness Assessment (busco_run)

  • Evaluates assembly completeness using BUSCO against conserved bacterial orthologs
  • Provides completeness scores and missing genes analysis

8. Taxonomic Validation (pgap_taxcheck)

  • Performs taxonomic classification using PGAP
  • Validates assembly taxonomy with ANI-based methods
  • Generates detailed taxonomic reports

9. Summary Tables (nanoplot_qc_summary, busco_summary, taxcheck_summary)

  • Creates consolidated QC summary from NanoPlot stats
  • Generates BUSCO completeness summary across samples
  • Generates taxonomic validation summary with top ANI hits

10. Genome Annotation (bakta_annot)

  • Annotates assembled genomes using Bakta
  • Identifies genes, CDS, tRNAs, tmRNAs, rRNAs, ncRNAs, and CRISPR arrays
  • Identifies AMR genes with AMRFinderPlus
  • Outputs annotation results in TSV format

11. AMR Summary (amr_summary)

  • Extracts antimicrobial resistance genes from Bakta annotations
  • Filters for NCBI Protein and NCBI Family entries
  • Creates summary table with sample, contig, gene, and product information

Output Structure

results/
├── qc/
│   ├── sequencing/
│   │   └── nanoplot/
│   │       ├── report/
│   │       ├── stats/
│   │       ├── scatter/
│   │       └── nanoplot_qc_summary.tsv
│   └── assemblies/
│       ├── {sample}_assembly_stats.txt
│       └── assembly_qc_summary.txt
├── assemblies/
│   └── flye/
│       └── {sample}/
│           ├── assembly.fasta
│           ├── assembly_graph.gfa
│           └── assembly_info.txt
├── completeness/
│   └── busco/
│       ├── {sample}/
│       │   ├── full_table.tsv
│       │   ├── missing_busco_list.tsv
│       │   └── short_summary.txt
│       └── busco_summary_{lineage}.tsv
├── annotation/
│   ├── taxcheck/
│   │   ├── {sample}.ani-tax-report.txt
│   │   └── taxcheck_summary.tsv
│   └── bakta/
│       └── {sample}/
│           ├── {sample}.tsv
│           └── {sample}_amr_summary.tsv
├── benchmarks/
└── logs/

Testing

Test dataset in test_data/:

  • PacBio HiFi 100 reads unaligned BAM files
  • 2 Samples 77SG_Aer_FM and 137DG-Aer_FM

Documentation

Tools Used

  • NanoPlot - Long-read QC and visualization
  • Flye - Long-read de novo assembly
  • Seqkit - Sequence toolkit for statistics
  • BUSCO - Benchmarking Universal Single-Copy Orthologs for completeness assessment
  • PGAP - Prokaryotic Genome Annotation Pipeline for taxonomic validation
  • Bakta - Rapid and standardized annotation of bacterial genomes
  • Samtools - BAM file manipulation

Citation

If you use this workflow, please cite the respective tools:

  • Flye: Kolmogorov M, et al. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol. 2019.
  • NanoPlot: De Coster W, et al. NanoPlot: visualizing and correcting long reads. Bioinformatics. 2017.
  • Seqkit: Shen W, et al. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS ONE. 2016.
  • BUSCO: Seppey M, et al. BUSCO: assessing genome assembly and annotation completeness. Methods Mol Biol. 2019.
  • PGAP: Tatusova T, et al. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res. 2016.
  • Bakta: Schwengers O, et al. Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microb Genom. 2021.

License

This project is licensed under the MIT License - see LICENSE.md for details.

Support

For issues, questions, or contributions, please use the GitHub Issues page or submit a pull request.

Acknowledgments

This workflow follows the WorkflowHub best practices for reproducible bioinformatics workflows.

About

Snakemake workflow to assemble and annotate bacterial genomes from HiFi reads

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages