Skip to content

Create_HC_Subset

Fernanda edited this page May 29, 2018 · 16 revisions

Basic Usage

The Create_HC_Subset handler creates a single variant call format (VCF) file that contains only the high-confidence (HC) sites for your samples. This filtering is performed in multiple steps using several different user-defined parameters and before-and-after percentile tables are generated. The steps are as follows:

  1. The genomic part VCF files from Genotype_GVCFs are gzipped, while preserving the original files.
  2. The gzipped files are merged using VCFtools into a single VCF file.
  3. If the data is exome capture, the sites outside the exome capture region are filtered out using vcflib. If not, then nothing happens.
  4. Insertions and deletions (indels) are filtered out using VCFtools.
  5. Percentile tables of DP per sample and GQ are generated for the "raw" VCF file.
  6. A custom python3 script is run to filter out low-confidence sites based on the user parameters defined in the configuration file.
  7. Percentile tables of DP per sample and GQ are generated for the filtered VCF file.
  8. If the organism is barley, the parts positions are converted into pseudomolecular positions using a custom python3 script. If not, then nothing happens.

In step six, the filtering script does the following, based on the user parameters defined in the configuration file:

  • Filters out indels and sites with more than two alleles
  • If the quality score is missing or the site quality score is too low, filters out the site
  • If too many samples are heterozygous, filters out the site
  • If too many samples are "bad" (missing, low quality, or low depth), filters out the site

To run Create_HC_Subset, all common variables and handler-specific variables must be defined within the configuration file. Once the variables have been defined, Create_HC_Subset can be submitted to a job scheduler with the following command (assuming that you are in the directory containing sequence_handling):

./sequence_handling Create_HC_Subset Config

Where Config is the full file path to the configuration file.

Handler-Specific Variables

The following are a list of variables that need to be defined within Config. In addition to the handler-specific variables, all common variables must be defined.

Variable Function
CHS_QSUB QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00".
CHS_VCF_LIST A list of full file paths to the chromosome part VCF files from Genotype_GVCFs. This can be generated with sample_list_generator.sh.
CAPTURE_REGIONS The full file path to the capture regions file in BED format. This should be the same file as the REGIONS_FILE in Coverage_Mapping. If not exome capture, put "NA".
CHS_DP_PER_SAMPLE_CUTOFF The depth per sample (DP) cutoff. If a sample's DP is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 5
CHS_GQ_CUTOFF The genotyping quality (GQ) cutoff. If a sample's GQ is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 10th percentile of the raw GQ percentile table. This may involve a "guess and check" strategy and running Create_HC_Subset multiple times.
CHS_MAX_BAD The maximum number of "bad" (low GQ, low DP, or missing genotype data) samples allowed at a site. Sites with more "bad" samples than this threshold will be filtered out. Recommended value: total number of samples * 0.2 (rounded to the nearest whole number)
CHS_MAX_HET The maximum number of samples at a site that can be heterozygous. Sites with more heterozygous samples than this threshold will be filtered out. Recommended value: total number of samples * 0.9 (rounded to the nearest whole number)
CHS_QUAL_CUTOFF The site quality score (QUAL) cutoff. Sites with a QUAL below this cutoff will be excluded. Recommended value: 40

Output

Create_HC_Subset generates a VCF file containing only the high-confidence variants. The VCF file can be found at

${OUT_DIR}/Create_HC_Subset/${PROJECT}_high_confidence_subset.vcf

Dependencies

Create_HC_Subset depends on VCFtools and vcflib for manipulating the VCF file. Create_HC_Subset also calls on helper scripts that require R, GNU Parallel, and Python3. In addition, PBS is required for basic operation. Please check the dependencies page to ensure that you are using the required version of each dependency.

Clone this wiki locally