This document introduces a complete workflow for cohort CSV (Complex Structural Variation) analysis using the HitSV tool. This workflow is suitable for Long Read Sequencing (LRS) data and implements a complete process from single-sample variant detection to population CSV region analysis, filtering, precise detection, filtering, and clustering analysis through a series of steps.
The workflow mainly includes the following core functions:
- Preliminary variant detection at the single-sample level
- CSV region identification based on population data
- Screening of medium and high-frequency CSV regions
- High-precision Force Calling variant detection
- Filtering and storage of Contig sequences
- Contig clustering analysis based on RM-TRF annotation
This workflow is suitable for structural variation analysis of large-scale population samples and can help researchers efficiently identify structurally variant regions of important biological significance in the population.
- Workflow Overview
- Data Preparation
- Single-Sample Variant Detection
- Single-Sample CSV Detection and Population CSV Region Detection
- Screening Medium and High-Frequency Population CSV
- Recall Local Contig Sequences for Each Sample Based on BED File
- Contig Filtering
- Store Variants in Corresponding Regions
- Contig Clustering Analysis
- Single-Sample Variant Detection: Perform preliminary variant detection for each sample, generate VCF files
- Population CSV Region Detection: Analyze VCF files of all samples to identify potential CSV regions
- Screening Medium and High-Frequency Population CSV: Filter out CSV regions with sufficient sample support based on allele frequency
- Force Calling: Perform high-precision variant detection again for each sample based on the filtered CSV regions, generate contig sequences
- Contig Filtering: Filter out contigs with insufficient read support
- Variant Region Storage: Classify and store filtered contigs according to regions
- Contig Clustering Analysis: Perform RM-TRF annotation and further clustering analysis on contigs
BAM.list: A file describing the storage paths of BAM results for all samples, and storing sample names. PRESET: Describes the type of LRS sequencing (ONT_Q26, ASM, ERR_PRONE, etc.)
#!/bin/bash
INPUT=$1
WORK_DIR=$2
PRESET=$3
#Fixed parameters
GCSV_TOOL="HitSVL call "
REF=GRCh38.fa
mkdir ${WORK_DIR}
cd ${WORK_DIR}
BAM_ORI=${INPUT}
VCF=${WORK_DIR}/D.vcf
LOG=${WORK_DIR}/D.log
${GCSV_TOOL} -p ${PRESET} -l ${BAM_ORI} -r ${REF} -o ${VCF} 2> ${LOG}cat BAM.list | awk '{printf "sbatch %s ./RST/%s ERR_PRONE \n",$1,$2}' > run_all_SV_calling.sh
bash run_all_SV_calling.sh#!/bin/bash
#Parameters for different samples
SAMPLE_NAME=$1
IN_VCF=./${SAMPLE_NAME}/D.vcf
WORK_DIR=./${SAMPLE_NAME}/
#Fixed parameters
GCSV_TOOL="HitSV "
REF=GRCh38.fa
mkdir ${WORK_DIR}
cd ${WORK_DIR}
LOG=${WORK_DIR}/D.log
${GCSV_TOOL} tools csv_region_500w ${IN_VCF} 500 50 > ${SAMPLE_NAME}.main.bed 2> ${LOG}
cat ${SAMPLE_NAME}.main.bed | grep "HAP1" | grep "500W" | sort | uniq > ${SAMPLE_NAME}.hap1.bed
cat ${SAMPLE_NAME}.main.bed | grep "HAP2" | grep "500W" | sort | uniq > ${SAMPLE_NAME}.hap2.bedmkdir ./RST_ANA_CSV/
cat BAM.list | awk '{printf "sbatch csv_region_500w_ana.sh %s \n",$2}' > run_all_csv_region_500w_ana.sh
bash run_all_csv_region_500w_ana.sh > run_all_csv_region_500w_ana.bash.log &cd ./RST_ANA_CSV/
cat */*hap*bed > ALL_single_hap.bed
cat ALL_single_hap.bed | awk '{printf "%s:%s-%s\n",$1,$2,$3;}' | sort | uniq -c > ALL_single_hap_summary.txtcat ALL_single_hap_summary.txt | awk '{split($2, A, ":"); split(A[2], B, "-"); if($1 >=56) printf "%s\t%s\t%s\t%s\n", A[1],B[1],B[2],$1}' > AF_0.05_CSV_AF.bedcd ./RST_ANA_CSV/
bedtools sort -i AF_0.05_CSV_AF.bed > AF_0.05_CSV_AF.sort.bed
bedtools merge -i AF_0.05_CSV_AF.sort.bed > AF_0.05_CSV_AF.sort.merge.bed#!/bin/bash
#Parameters
BAM_FILE=$1
OUT_DIR=$2
BED_FILE=$3
PRESET=$4
#Fixed parameters
GCSV_TOOL="HitSVL call "
REF=GRCh38.fa
mkdir -p ${OUT_DIR}
cd ${OUT_DIR}
${GCSV_TOOL} -p ${PRESET} -b ${BED_FILE} -l ${BAM_FILE} -r ${REF} -o ./HitSV_FC_CSV_RST.txt 2> ./HitSV_FC_CSV_RST.logcd ./
cat BAM.list | awk '{printf "sbatch ./single_sample.sh %s ./RST/%s/ ./AF_0.05_CSV_AF.sort.merge.bed ERR_PRONE \n",$1,$2;}' > ./single_FC_HitSV_run_ALL.sh
bash ./single_FC_HitSV_run_ALL.sh > ./single_FC_HitSV_run_ALL.log &
Filter as needed, here we filtered out contigs with less than 2 supporting reads
#!/bin/bash
#Parameters
IN_DIR=$1
cd ${IN_DIR}
#Delete results with read support equal to 1
cat HitSV_FC_CSV_RST.txt | awk '{split($4,A,"="); split(A[2],B,":"); if(B[1] > 1) print $0}' > HitSV_FC_CSV_RST.FILTER.txtcd ./
cat BAM.list | awk '{printf "sbatch ./single_sample_FILTER.sh ./RST/%s/\n",$2;}' > ./single_sample_FILTER_run_ALL.sh
bash ./single_sample_FILTER_run_ALL.sh > ./single_sample_FILTER_run_ALL.logcd ./
cat BAM.list | awk '{printf "./RST/%s/HitSV_FC_CSV_RST.FILTER.txt\n",$2}' > ont_contig_string_data_ALL_FL.txtHitSVL tools contig_file_split ont_contig_string_data_ALL_FL.txt ./CONTIG_IN_REGION/ GRCh38.fa#!/bin/bash
PREFIX=$1
WORD_DIR=$2
FC_DIR=$3
GCSV_TOOL="HitSVL call "
REF_38=GRCh38.fa
RM_TOOL=RepeatMasker
TRF_TOOL=trf
cd ${WORD_DIR}
mkdir ${PREFIX}
cd ${PREFIX}
cp ${FC_DIR}/${PREFIX} ./${PREFIX}.fa
#Execute RM annotation:
${RM_TOOL} ${PREFIX}.fa > rm_log.txt 2> rm_log2.txt
#Execute TRF annotation:
${TRF_TOOL} ${PREFIX}.fa 2 7 7 80 10 10 500 -h -d -ngs 1> ${PREFIX}.fa.trf.out 2> trf_log2.txt
#Execute population CSV clustering analysis:
${GCSV_TOOL} tools fc_joint_ana ${PREFIX}.fa ${PREFIX}.fa.out ${PREFIX}.fa.trf.out 1cd ./RST/
cat AF_0.05_CSV_AF.sort.merge.bed | awk '{printf "sbatch run_single_ANNO.sh %s_%s_%s ONT_567/RST/ ./CONTIG_IN_REGION/ \n",$6,$7,$8;}' > ONT_567/run_all.sh
bash ONT_567/run_all.sh > ONT_567/run_all.log &