This repository contains a collection of Python and R scripts used for analyzing Nanopore direct RNA sequencing (DRS) data, particularly focusing on viral reads characterization and defective viral genome (DVG) analysis.
These scripts facilitate the analysis of Nanopore DRS data for viral genome characterization, including:
- Read length distribution visualization
- Coverage depth visualization
- Classification of spliced and unspliced viral reads
- Recombination junction site identification
- Poly(A) tail length visualization
- Chimeric read visualization
- R version: 4.5.0 or higher
- Platform: Compatible with macOS, Linux, and Windows
# Core packages
library(ggplot2) # v4.0.1
library(dplyr) # v1.1.4
library(data.table) # v1.17.8
library(cowplot) # v1.2.0
# Additional dependencies (auto-loaded)
# RColorBrewer, tidyselect, scales, etc.- Python 3.x
- pysam
install.packages(c("ggplot2", "dplyr", "data.table", "cowplot"))pip install pysamA pysam-based custom Python script that converts BAM alignment files to BED format to facilitate downstream analysis in R.
Usage:
python bam2bed.py --input <input.bam> --output <output.bed> --label virusWe provided demo data to run Rscripts,
wget https://github.com/lrslab/DVG_ONT_scripts/releases/download/zenodo_version/demo_data.zip
unzip demo_data.zipPlots the read length distribution of Nanopore DRS reads.
- Output: Read length boxplots (e.g., Figure S1)
Plots the coverage depth of viral reads across the genome.
- Output: Coverage depth visualization (e.g., Figure 1C)
Classifies viral reads into two categories:
- Contiguous reads: Unspliced, continuous alignment
- Noncontiguous reads: Spliced, with gaps in alignment
Further classifies noncontiguous reads based on:
- Fragment count
- Strand orientation
- Assignment to genomic RNA (gRNA) or subgenomic RNA (sgRNA)
Plots the start and end positions of contiguous reads.
- Output: Position distribution plot (e.g., Figure S8)
Plots the start and end positions of noncontiguous reads.
- Output: Spliced read position plot (e.g., Figure 1E)
Plots the start and end positions of virus-host chimeric reads.
- Output: Chimeric read visualization (e.g., Figure 1E)
Analyzes and plots the poly(A) tail length distribution for different classes of reads.
- Output: Poly(A) length distribution (e.g., Figure S12)
Visualizes the junction sites of noncontiguous reads to identify recombination events.
- Output: Recombination junction plot (e.g., Figure 4A)
Displays the read distribution along the viral reference genome, showing coverage patterns.
- Output: Genome segment coverage plot (e.g., Figure 6A)
-
Convert BAM to BED format:
python bam2bed.py aligned_reads.bam output.bed
-
Run R analysis scripts in order:
source("1_Read Length Boxplot.R") source("2_Coverage Plot.R") source("3_Viral Spliced and Unspliced Reads.R") # ... continue with other scripts as needed
-
Customize parameters within each script according to your data and analysis needs.
The scripts were developed and tested in the following environment:
R version 4.5.0 (2025-04-11)
Platform: aarch64-apple-darwin20
Running under: macOS 26.0.1
attached packages:
[1] cowplot_1.2.0 data.table_1.17.8 dplyr_1.1.4 ggplot2_4.0.1For questions or issues, please open an issue in this repository or contact Dr. TAN Lu.
Note: Ensure all input files are properly formatted and paths are correctly specified in the scripts before running.