« Previous - Version 6/9 (diff) - Next » - Current version
Cristina Gonzalez, 12/04/2012 11:01 am
Introduction to filtering

HPG Variant VCF Tools

Biologists receive so much biological data that they have to spend a lot of time cleaning it up in order to get just the data they are interested in. HPG VCF Tools is a set of tools for preprocessing, filtering and manipulating VCF files. It aims to avoid excessive time consumption in tedious preprocessing tasks.

Supported input formats

Splitting a VCF file

A set of VCF files can be created by splitting one by a criterion. Each one of the output files is a fully valid VCF file.

The most basic command-line for invoking this tool is:

hpg-var-vcf split -v your_vcf_file.vcf --criterion chromosome

Currently available criteria are:

  • By chromosome: Each output file will be named chromosome_N_your_vcf_file.vcf (N being the chromosome name) and will contain the entries from a single chromosome.

Filtering entries from a VCF file

If you are only interested in the entries of a VCF file that satisfy certain criteria, you can apply a collection of filters to the input. Currently available filters are:

  • By region
  • By region (from a GFF file)
  • By correspondence to a SNP
  • By minimum quality
  • By minimum coverage
  • By minimum allele frequency (MAF)
  • By number of alleles

The most basic command-line for invoking this tool is:

hpg-var-vcf filter -v your_vcf_file.vcf --quality 40

Several filters can be applied at the same time:

hpg-var-vcf filter -v your_vcf_file.vcf --quality 40 --maf 0.02 --snp include

By default, only the entries that pass the filters are written to an output file, named your_vcf_file.vcf.filtered.
If you also want to save the entries that failed the tests, add the --save-rejected flag to your command-line. They will be written to a file named your_vcf_file.vcf.rejected.

Getting statistics from a VCF file

General stats

  • Number of variants
  • Number of samples
  • Number of bi-allelic sites
  • Number of multi-allelic sites
  • Number of SNP
  • Number of indels
  • Number of transitions
  • Number of transversions
  • Ti/TV ratio
  • Percentage of PASS
  • Average quality in the VCF

Statistics per variant

  • Allelic and genotypic counts and frequencies per variant
  • Number of missing alleles and genotypes

Statistics per sample

Merging multiple VCF files

Feature plan

Split, filter and statistics tools will be enriched with more options. For more information, see the detailed feature plan.

Other tools suites