« Previous - Version 8/9 (diff) - Next » - Current version
Cristina Gonzalez, 12/04/2012 04:19 pm
VCF statistics

HPG Variant VCF Tools

Biologists receive so much biological data that they have to spend a lot of time cleaning it up in order to get just the data they are interested in. HPG VCF Tools is a set of tools for preprocessing, filtering and manipulating VCF files. It aims to avoid excessive time consumption in tedious preprocessing tasks.

Supported input formats

Splitting a VCF file

A set of VCF files can be created by splitting one by a criterion. Each one of the output files is a fully valid VCF file.

The most basic command-line for invoking this tool is:

hpg-var-vcf split -v your_vcf_file.vcf --criterion chromosome

Currently available criteria are:

  • By chromosome: Each output file will be named chromosome_N_your_vcf_file.vcf (N being the chromosome name) and will contain the entries from a single chromosome.

Filtering entries from a VCF file

If you are only interested in the entries of a VCF file that satisfy certain criteria, you can apply a collection of filters to the input. Currently available filters are:

  • By region (--region, --region-file)
  • By correspondence to a SNP (--snp)
  • By quality (--quality)
  • By coverage (--coverage)
  • By minimum allele frequency (MAF) (--maf)
  • By number of alleles (--alleles)

The most basic command-line for invoking this tool is:

hpg-var-vcf filter -v your_vcf_file.vcf --quality 40

Several filters can be applied at the same time:

hpg-var-vcf filter -v your_vcf_file.vcf --quality 40 --maf 0.02 --snp include

By default, only the entries that pass the filters are written to an output file, named your_vcf_file.vcf.filtered.
If you also want to save the entries that failed the tests, add the --save-rejected flag to your command-line. They will be written to a file named your_vcf_file.vcf.rejected.

By region

Regions can be specified directly via the command-line argument --region, followed by a comma-separated list of regions. Each region must be described as chromosome:start-end, such as 1:12345-67890.

If the list of regions is stored in a GFF file, it can be referenced using the --region-file option, followed by the name of the file, such as:

hpg-var-vcf filter -v your_vcf_file.vcf --region-file your_regions_file.gff

By correspondence to a SNP

Whether you want to include (or exclude) SNP from the input file, you must add the --snp include (or --snp exclude) command-line option, like in:

hpg-var-vcf filter -v your_vcf_file.vcf --snp exclude

By quality / coverage

Minimum accepted quality and coverage can be specified via the --quality and --coverage options, like in:

hpg-var-vcf filter -v your_vcf_file.vcf --quality 40 --coverage 30

By minimum-allele frequency (MAF)

Maximum accepted MAF can be specified via the --maf option, like in:

hpg-var-vcf filter -v your_vcf_file.vcf --maf 0.02

This way, only positions with a MAF < 0.02 will pass the filter.

By number of alleles

Variants may be bi-allelic or multi-allelic. In case you want to choose the variants with a certain number of alleles, the command-line would be something like:

hpg-var-vcf filter -v your_vcf_file.vcf --alleles 3

Getting statistics from a VCF file

This tool reports statistics for each variant and sample in the VCF file, as well as statistics referred to the whole file. A command-line for invoking this tool is:

hpg-var-vcf stats -v your_vcf_file.vcf --variants --samples

If you just want to retrieve statistics per sample, the command-line will be:

hpg-var-vcf stats -v your_vcf_file.vcf --samples

If no input arguments are provided, statistics about variants will be retrieved. If you just want to report those, you can omit the specific flags:

hpg-var-vcf stats -v your_vcf_file.vcf

Global statistics

  • Number of variants
  • Number of samples
  • Number of bi-allelic sites
  • Number of multi-allelic sites
  • Number of SNP
  • Number of indels
  • Number of transitions
  • Number of transversions
  • Ti/TV ratio
  • Percentage of PASS
  • Average quality in the VCF

Statistics per variant

  • Allelic counts and frequencies
  • Genotypic counts and frequencies
  • Number of missing alleles and genotypes

Statistics per sample

  • Missing genotypes

Merging multiple VCF files

Other tools suites