-
Notifications
You must be signed in to change notification settings - Fork 9
Create_HC_Subset
The Create_HC_Subset handler creates a single variant call format (VCF) file that contains only the high-confidence (HC) sites for your samples. This filtering is performed in multiple steps using several different user-defined parameters and before-and-after percentile tables are generated. The steps are as follows:
- The genomic part VCF files from Genotype_GVCFs are gzipped, while preserving the original files.
- The gzipped files are merged using VCFtools into a single VCF file.
- If the data is exome capture, the sites outside the exome capture region are filtered out using vcflib. If not, then nothing happens.
- Insertions and deletions (indels) are filtered out using VCFtools.
- Percentile tables of DP per sample and GQ are generated for the "raw" VCF file.
- A custom python3 script is run to filter out low-confidence sites based on the user parameters defined in the configuration file.
- Percentile tables of DP per sample and GQ are generated for the filtered VCF file.
- If the organism is barley, the parts positions are converted into pseudomolecular positions using a custom python3 script. If not, then nothing happens.
In step six, the filtering script does the following, based on the user parameters defined in the configuration file:
- Filters out indels and sites with more than two alleles
- If the quality score is missing or the site quality score is too low, filters out the site
- If too many samples are heterozygous, filters out the site
- If too many samples are "bad" (missing, low quality, or low depth), filters out the site
To run Create_HC_Subset, all common variables and handler-specific variables must be defined within the configuration file. Once the variables have been defined, Create_HC_Subset can be submitted to a job scheduler with the following command (assuming that you are in the directory containing sequence_handling):
./sequence_handling Create_HC_Subset ConfigWhere Config is the full file path to the configuration file.
The following are a list of variables that need to be defined within Config. In addition to the handler-specific variables, all common variables must be defined.
| Variable | Function |
|---|---|
CHS_QSUB |
QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00". |
CHS_VCF_LIST |
A list of full file paths to the chromosome part VCF files from Genotype_GVCFs. This can be generated with sample_list_generator.sh. |
CAPTURE_REGIONS |
The full file path to the capture regions file in BED format. This should be the same file as the REGIONS_FILE in Coverage_Mapping. If not exome capture, put "NA". |
CHS_DP_PER_SAMPLE_CUTOFF |
The depth per sample (DP) cutoff. If a sample's DP is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 5 |
CHS_GQ_CUTOFF |
The genotyping quality (GQ) cutoff. If a sample's GQ is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 10th percentile of the raw GQ percentile table. This may involve a "guess and check" strategy and running Create_HC_Subset multiple times. |
CHS_MAX_BAD |
The maximum number of "bad" (low GQ, low DP, or missing genotype data) samples allowed at a site. Sites with more "bad" samples than this threshold will be filtered out. Recommended value: total number of samples * 0.2 (rounded to the nearest whole number) |
CHS_MAX_HET |
The maximum number of samples at a site that can be heterozygous. Sites with more heterozygous samples than this threshold will be filtered out. Recommended value: total number of samples * 0.9 (rounded to the nearest whole number) |
CHS_QUAL_CUTOFF |
The site quality score (QUAL) cutoff. Sites with a QUAL below this cutoff will be excluded. Recommended value: 40 |
Create_HC_Subset generates a VCF file containing only the high-confidence variants. The VCF file can be found at
${OUT_DIR}/Create_HC_Subset/${PROJECT}_high_confidence_subset.vcfCreate_HC_Subset depends on VCFtools and vcflib for manipulating the VCF file. Create_HC_Subset also calls on helper scripts that require R, GNU Parallel, and Python3. In addition, PBS is required for basic operation. Please check the dependencies page to ensure that you are using the required version of each dependency.
Next: Variant_Recalibrator
- Getting Started
- Recommended Workflow
- Configuration
- Dependencies
- sample_list_generator.sh
- Slurm specific options
- Common Problems and Errors