Damage-Seq Pipeline

This repository contains the full pipeline for processing and quality-controlling Damage-seq data. It includes trimming, alignment, filtering, deduplication, k-mer QC, and dipyrimidine filtering steps.

Directory Structure

damage-seq-pipeline/
├── src/
│   ├── 1_mycutadapt.py               # Adapter trimming
│   ├── 2_humanbowtie2pairend.py      # Alignment to genome using Bowtie2
│   ├── 3_samtobam_pair.py            # Convert SAM to BAM, paired-end filtering
│   ├── 4_ParseSort.py                # Parse BAM and sort BED
│   ├── 5_Deduplicate.py              # Deduplicate reads
│   ├── 6_sortbamtobedpetobed.sh      # Sort and convert BAM to BED
│   ├── 7_bedto10bed.sh               # Convert BED to 10nt windows
│   ├── 8_Sort_bed_LL.py              # Final sort of 10nt BED
│   ├── 9_fastatobed.py               # Convert filtered FASTA to BED
│   ├── 10_fa2kmerAbundanceTable.py   # Generate k-mer frequency table
│   ├── 11_dipyrimidine_filter.py     # Filter dipyrimidine-containing reads
│   └── 12_fasta2bed.py               # Convert filtered FASTA to BED
├── qc/
│   ├── dipyrimidine_qc_plot.R        # R script for dipyrimidine bar plots
│   ├── sequence.py                   # Dinucleotide sequence QC
│   └── fasta.py                      # K-mer frequency analysis
├── archive/
│   └── (legacy or unused scripts)

Requirements

Python 3
R with ggplot2, wesanderson, ggthemes, tidyr, dplyr
Samtools
Bedtools
Cutadapt
Bowtie2

Run Instructions

Adapter trimming: src/1_mycutadapt.py
Align with Bowtie2: src/2_humanbowtie2pairend.py
Convert to BAM and filter pairs: src/3_samtobam_pair.py
Sort BAM to BED: src/4_ParseSort.py and src/6_sortbamtobedpetobed.sh
Deduplicate: src/5_Deduplicate.py
Create 10nt fragments: src/7_bedto10bed.sh
Sort 10nt BEDs: src/8_Sort_bed_LL.py
Get FASTA and filter dipyrimidines: src/9_fastatobed.py, src/11_dipyrimidine_filter.py
Generate QC plots: qc/dipyrimidine_qc_plot.R

Notes

Ensure all dependencies are loaded before running (e.g., via modules or conda).
Input data should be gzipped paired-end FASTQ files.
All scripts are configured for SLURM submission with sbatch.
filtered.bed for two biological repeats can be merged and deposited in a new folder "merged_bed"
for xrseq around histone markers: stranded xrseq merged bed files generated by strandedmergedbedfiles.sh
simulations applied to merged.bed

Author

Cansu Kose, 2025 – UNC Chapel Hill | Sancar Lab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Damage-Seq Pipeline

Directory Structure

Requirements

Run Instructions

Notes

Author

FilesExpand file tree

README_damage_seq_pipeline.md

Latest commit

History

README_damage_seq_pipeline.md

File metadata and controls

Damage-Seq Pipeline

Directory Structure

Requirements

Run Instructions

Notes

Author