« Previous - Version 7/9 (diff) - Next » - Current version
Cristina Gonzalez, 12/04/2012 04:12 pm
Filtering of VCF files


HPG Variant VCF Tools

Biologists receive so much biological data that they have to spend a lot of time cleaning it up in order to get just the data they are interested in. HPG VCF Tools is a set of tools for preprocessing, filtering and manipulating VCF files. It aims to avoid excessive time consumption in tedious preprocessing tasks.

Supported input formats

Splitting a VCF file

A set of VCF files can be created by splitting one by a criterion. Each one of the output files is a fully valid VCF file.

The most basic command-line for invoking this tool is:

hpg-var-vcf split -v your_vcf_file.vcf --criterion chromosome

Currently available criteria are:

  • By chromosome: Each output file will be named chromosome_N_your_vcf_file.vcf (N being the chromosome name) and will contain the entries from a single chromosome.

Filtering entries from a VCF file

If you are only interested in the entries of a VCF file that satisfy certain criteria, you can apply a collection of filters to the input. Currently available filters are:

  • By region (--region, --region-file)
  • By correspondence to a SNP (--snp)
  • By quality (--quality)
  • By coverage (--coverage)
  • By minimum allele frequency (MAF) (--maf)
  • By number of alleles (--alleles)

The most basic command-line for invoking this tool is:

hpg-var-vcf filter -v your_vcf_file.vcf --quality 40

Several filters can be applied at the same time:

hpg-var-vcf filter -v your_vcf_file.vcf --quality 40 --maf 0.02 --snp include

By default, only the entries that pass the filters are written to an output file, named your_vcf_file.vcf.filtered.
If you also want to save the entries that failed the tests, add the --save-rejected flag to your command-line. They will be written to a file named your_vcf_file.vcf.rejected.

By region

Regions can be specified directly via the command-line argument --region, followed by a comma-separated list of regions. Each region must be described as chromosome:start-end, such as 1:12345-67890.

If the list of regions is stored in a GFF file, it can be referenced using the --region-file option, followed by the name of the file, such as:

hpg-var-vcf filter -v your_vcf_file.vcf --region-file your_regions_file.gff

By correspondence to a SNP

Whether you want to include (or exclude) SNP from the input file, you must add the --snp include (or --snp exclude) command-line option, like in:

hpg-var-vcf filter -v your_vcf_file.vcf --snp exclude

By quality / coverage

Minimum accepted quality and coverage can be specified via the --quality and --coverage options, like in:

hpg-var-vcf filter -v your_vcf_file.vcf --quality 40 --coverage 30

By minimum-allele frequency (MAF)

Maximum accepted MAF can be specified via the --maf option, like in:

hpg-var-vcf filter -v your_vcf_file.vcf --maf 0.02

This way, only positions with a MAF < 0.02 will pass the filter.

By number of alleles

Variants may be bi-allelic or multi-allelic. In case you want to choose the variants with a certain number of alleles, the command-line would be something like:

hpg-var-vcf filter -v your_vcf_file.vcf --alleles 3

Getting statistics from a VCF file

General stats

  • Number of variants
  • Number of samples
  • Number of bi-allelic sites
  • Number of multi-allelic sites
  • Number of SNP
  • Number of indels
  • Number of transitions
  • Number of transversions
  • Ti/TV ratio
  • Percentage of PASS
  • Average quality in the VCF

Statistics per variant

  • Allelic and genotypic counts and frequencies per variant
  • Number of missing alleles and genotypes

Statistics per sample

Merging multiple VCF files

Feature plan

Split, filter and statistics tools will be enriched with more options. For more information, see the detailed feature plan.

Other tools suites