-
Notifications
You must be signed in to change notification settings - Fork 9
Create_HC_Subset
The Create_HC_Subset handler creates a single variant call format (VCF) file that contains only the high-confidence sites for your samples. This filtering is performed in multiple steps using several different user-defined parameters and before-and-after percentile tables are generated. The steps are as follows:
- The genomic part VCF files from Genotype_GVCFs are gzipped, while preserving the original files.
- The gzipped files are merged using VCFtools into a single VCF file.
- Insertions and deletions (indels) are filtered out using VCFtools.
- If the data is exome capture, the sites outside the exome capture region are filtered out using vcflib. If not, then nothing happens.
- Percentile tables of DP per sample and GQ are generated for the "raw" VCF file.
- A custom python3 script is run to filter out low-confidence sites based on the user parameters defined in the configuration file.
- Percentile tables of DP per sample and GQ are generated for the filtered VCF file.
- If the organism is barley, the parts positions are converted into pseudomolecular positions using a custom python3 script. If not, then nothing happens.
In step six, the filtering script does the following, based on the user parameters defined in the configuration file:
- Filters out indels and sites with more than two alleles
- If the quality score is missing or the site quality score is too low, filters out the site
- If too many samples are heterozygous, filters out the site
- If too many samples are "bad" (missing, low quality, or low depth), filters out the site
To run Create_HC_Subset, all common variables and handler-specific variables must be defined within the configuration file. Once the variables have been defined, Create_HC_Subset can be submitted to a job scheduler with the following command (assuming that you are in the directory containing sequence_handling):
./sequence_handling Create_HC_Subset ConfigWhere Config is the full file path to the configuration file.
The following are a list of variables that need to be defined within Config. In addition to the handler-specific variables, all common variables must be defined.
| Variable | Function |
|---|---|
CHS_QSUB |
QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00". |
CHS_VCF_LIST |
A list of full file paths to the chromosome part VCF files from Genotype_GVCFs. This can be generated with sample_list_generator.sh. |
CAPTURE_REGIONS |
The full file path to the capture regions file in BED format. This should be the same file as the REGIONS_FILE in Coverage_Mapping. If not exome capture, put "NA". |
DP_PER_SAMPLE_CUTOFF |
The depth per sample (DP) cutoff. If a sample's DP is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 5 |
GQ_CUTOFF |
The genotyping quality (GQ) cutoff. If a sample's GQ is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 10th percentile of the raw GQ percentile table. This may involve a "guess and check" strategy and running Create_HC_Subset multiple times. |
MAX_BAD |
The maximum number of "bad" (low GQ, low DP, or missing genotype data) samples allowed at a site. Sites with more "bad" samples than this threshold will be filtered out. Recommended value: total number of samples * 0.2 (rounded to the nearest whole number) |
MAX_HET |
The maximum number of samples at a site that can be heterozygous. Sites with more heterozygous samples than this threshold will be filtered out. Recommended value: total number of samples * 0.9 (rounded to the nearest whole number) |
QUAL_CUTOFF |
The site quality score (QUAL) cutoff. Sites with a QUAL below this cutoff will be excluded. Recommended value: 40 |
Create_HC_Subset generates a VCF file containing only the high-confidence variants. The VCF file can be found at
${OUT_DIR}/Create_HC_SubsetCreate_HC_Subset depends on VCFtools and vcflib for manipulating the VCF file. Create_HC_Subset also calls on helper scripts that require R and Python3. In addition, PBS is required for basic operation. Please check the dependencies page to ensure that you are using the required version of each dependency.
Next: Variant_Recalibrator
- Getting Started
- Recommended Workflow
- Configuration
- Dependencies
- sample_list_generator.sh
- Slurm specific options
- Common Problems and Errors