This repository contains the full pipeline for processing and quality-controlling Damage-seq data. It includes trimming, alignment, filtering, deduplication, k-mer QC, and dipyrimidine filtering steps.
damage-seq-pipeline/
├── src/
│ ├── 1_mycutadapt.py # Adapter trimming
│ ├── 2_humanbowtie2pairend.py # Alignment to genome using Bowtie2
│ ├── 3_samtobam_pair.py # Convert SAM to BAM, paired-end filtering
│ ├── 4_ParseSort.py # Parse BAM and sort BED
│ ├── 5_Deduplicate.py # Deduplicate reads
│ ├── 6_sortbamtobedpetobed.sh # Sort and convert BAM to BED
│ ├── 7_bedto10bed.sh # Convert BED to 10nt windows
│ ├── 8_Sort_bed_LL.py # Final sort of 10nt BED
│ ├── 9_fastatobed.py # Convert filtered FASTA to BED
│ ├── 10_fa2kmerAbundanceTable.py # Generate k-mer frequency table
│ ├── 11_dipyrimidine_filter.py # Filter dipyrimidine-containing reads
│ └── 12_fasta2bed.py # Convert filtered FASTA to BED
├── qc/
│ ├── dipyrimidine_qc_plot.R # R script for dipyrimidine bar plots
│ ├── sequence.py # Dinucleotide sequence QC
│ └── fasta.py # K-mer frequency analysis
├── archive/
│ └── (legacy or unused scripts)
- Python 3
- R with
ggplot2,wesanderson,ggthemes,tidyr,dplyr - Samtools
- Bedtools
- Cutadapt
- Bowtie2
- Adapter trimming:
src/1_mycutadapt.py - Align with Bowtie2:
src/2_humanbowtie2pairend.py - Convert to BAM and filter pairs:
src/3_samtobam_pair.py - Sort BAM to BED:
src/4_ParseSort.pyandsrc/6_sortbamtobedpetobed.sh - Deduplicate:
src/5_Deduplicate.py - Create 10nt fragments:
src/7_bedto10bed.sh - Sort 10nt BEDs:
src/8_Sort_bed_LL.py - Get FASTA and filter dipyrimidines:
src/9_fastatobed.py,src/11_dipyrimidine_filter.py - Generate QC plots:
qc/dipyrimidine_qc_plot.R
- Ensure all dependencies are loaded before running (e.g., via modules or conda).
- Input data should be gzipped paired-end FASTQ files.
- All scripts are configured for SLURM submission with
sbatch. - filtered.bed for two biological repeats can be merged and deposited in a new folder "merged_bed"
- for xrseq around histone markers: stranded xrseq merged bed files generated by strandedmergedbedfiles.sh
- simulations applied to merged.bed
Cansu Kose, 2025 – UNC Chapel Hill | Sancar Lab