HPG Variant VCF Tools

Biologists receive so much biological data that they have to spend a lot of time cleaning it up in order to get just the data they are interested in. HPG VCF Tools is a set of tools for preprocessing, filtering and manipulating VCF files. It aims to avoid excessive time consumption in tedious preprocessing tasks.

Supported input formats

Splitting a VCF file

A set of VCF files can be created by splitting one by a criterion. Each one of the output files is a fully valid VCF file.

The most basic command-line for invoking this tool is:

hpg-var-vcf split -v your_vcf_file.vcf --criterion chromosome

Currently available criteria are:

  • By chromosome: Each output file will be named chromosome_N_your_vcf_file.vcf (N being the chromosome name) and will contain the entries from a single chromosome.

Filtering entries from a VCF file

If you are only interested in the entries of a VCF file that satisfy certain criteria, you can apply a collection of filters to the input. Currently available filters are:

  • By region (--region, --region-file)
  • By correspondence to a SNP (--snp)
  • By quality (--quality)
  • By coverage (--coverage)
  • By minimum allele frequency (MAF) (--maf)
  • By number of alleles (--alleles)

The most basic command-line for invoking this tool is:

hpg-var-vcf filter -v your_vcf_file.vcf --quality 40

Several filters can be applied at the same time:

hpg-var-vcf filter -v your_vcf_file.vcf --quality 40 --maf 0.02 --snp include

By default, only the entries that pass the filters are written to an output file, named your_vcf_file.vcf.filtered.
If you also want to save the entries that failed the tests, add the --save-rejected flag to your command-line. They will be written to a file named your_vcf_file.vcf.rejected.

By region

Regions can be specified directly via the command-line argument --region, followed by a comma-separated list of regions. Each region must be described as chromosome:start-end, such as 1:12345-67890.

If the list of regions is stored in a GFF file, it can be referenced using the --region-file option, followed by the name of the file, such as:

hpg-var-vcf filter -v your_vcf_file.vcf --region-file your_regions_file.gff

By correspondence to a SNP

Whether you want to include (or exclude) SNP from the input file, you must add the --snp include (or --snp exclude) command-line option, like in:

hpg-var-vcf filter -v your_vcf_file.vcf --snp exclude

By quality / coverage

Minimum accepted quality and coverage can be specified via the --quality and --coverage options, like in:

hpg-var-vcf filter -v your_vcf_file.vcf --quality 40 --coverage 30

By minimum-allele frequency (MAF)

Maximum accepted MAF can be specified via the --maf option, like in:

hpg-var-vcf filter -v your_vcf_file.vcf --maf 0.02

This way, only positions with a MAF < 0.02 will pass the filter.

By number of alleles

Variants may be bi-allelic or multi-allelic. In case you want to choose the variants with a certain number of alleles, the command-line would be something like:

hpg-var-vcf filter -v your_vcf_file.vcf --alleles 3

Getting statistics from a VCF file

This tool reports statistics for each variant and sample in the VCF file, as well as statistics referred to the whole file. A command-line for invoking this tool is:

hpg-var-vcf stats -v your_vcf_file.vcf --variants --samples

If you just want to retrieve statistics per sample, the command-line will be:

hpg-var-vcf stats -v your_vcf_file.vcf --samples

If no input arguments are provided, statistics about variants will be retrieved. If you just want to report those, you can omit the specific flags:

hpg-var-vcf stats -v your_vcf_file.vcf

Global statistics

  • Number of variants
  • Number of samples
  • Number of bi-allelic sites
  • Number of multi-allelic sites
  • Number of SNP
  • Number of indels
  • Number of transitions
  • Number of transversions
  • Ti/TV ratio
  • Percentage of PASS
  • Average quality in the VCF

Statistics per variant

  • Allelic counts and frequencies
  • Genotypic counts and frequencies
  • Number of missing alleles and genotypes

Statistics per sample

  • Missing genotypes

Merging multiple VCF files

This tool allows to merge multiple VCF files, for example, containing different samples of the same experiment. The simplest command-line you may use is:

hpg-var-vcf merge --vcf-list your_vcf_file_1.vcf,your_vcf_file_2.vcf,your_vcf_file_N.vcf

A new file named merge_from_N_files.vcf (N being the number of files merged) will be created in the output directory. The headers from all files are written to the new one.

Regarding variants (the most interesting part of the file), multi-sample and multi-allelic variants are accepted. In order to preserve coherence, several operations must be performed. For a better understanding of these operations, see the VCF specification.

  • When a multi-allelic variant is found, the different alleles are merged and the numerical indices that identify them recalculated.
  • The FILTER column is the union of the filters applied to a position.
  • Many of the standard INFO fields (such as AF, AN and DP) can be recalculated based on the merged data.
  • The FORMAT column is the union of all the FORMATs from the source files.
  • Samples are the most complex part. Their fields are structured according to the new FORMAT. Since the FILTER and INFO columns from the corresponding source file will be lost when recalculated, they can be copied to the samples on user demand.

Filling missing data

As stated before, sample fields in the output file are structured according to the union of all FORMATs. If one of these fields was absent in the source file, it is filled with the missing symbol.

If a sample is absent in all files for a certain position (e.g., does not have a variant in that position), the user can choose whether to mark them as equal to the reference genome or missing. This can be done with the --missing-mode option, which can get the values "missing" or "reference". The "missing" value will fill them with ./., whereas "reference" will be the same as 0/0. By default, "missing" is marked.

hpg-var-vcf merge --vcf-list your_vcf_file_1.vcf,your_vcf_file_2.vcf,your_vcf_file_N.vcf --missing-mode reference

Recalculating the INFO field

Many of the standard INFO fields can be recalculated based on the merged data. For this purpose, the --info-fields options must be used:

hpg-var-vcf merge --vcf-list your_vcf_file_1.vcf,your_vcf_file_2.vcf,your_vcf_file_N.vcf --info-fields DP,H2,NS

At the moment, the following fields can be recalculated:

  • AC: Allele count in genotypes, for each ALT allele, in the same order as listed
  • AF: Allele frequency for each ALT allele in the same order as listed
  • AN: Total number of alleles in called genotypes
  • DB: dbSNP membership
  • DP: Combined depth across samples
  • H2: Membership in HapMap2
  • H3: Membership in HapMap3
  • MQ: RMS mapping quality
  • MQ0: Number of MAPQ == 0 reads covering this record
  • NS: Number of samples with data
  • QD: Quality by depth (based on GATK)
  • SOMATIC: Somatic mutation, for cancer genomics
  • VALIDATED: Validated by follow-up experiment

Copying old FILTER and INFO fields

Since the FILTER and INFO columns from the corresponding source file will be lost when recalculated, they can be copied to the samples on user demand. In order to do that, the user may use one or both options, like illustrated in the following example:

hpg-var-vcf merge --vcf-list your_vcf_file_1.vcf,your_vcf_file_2.vcf,your_vcf_file_N.vcf --copy-filter --copy-info

Warning: The size of the output file will grow exponentially when old FILTER and INFO values are inserted!

Redirecting the output

By default, HPG Variant saves its output in the /tmp/variant/ folder. If you want to use a different one, you can set the --outdir flag, or edit the global/outdir property of the hpg-variant.conf file stored in your $(HOME)/.hpg-variant folder. This file will be created the first time you run HPG Variant, so you can run it once with the --outdir flag set, then edit the global/outdir entry of the configuration file.

Warning: If you run HPG Variant several times in a row, don't forget to save the results in different folders! The contents of the output folder are overwritten with each execution.

Advanced configuration

There are some other options that may improve the performance of HPG Variant but are not recommended to tweak without basic knowledge of the machine architecture where the application runs. These are:

max-batches: The maximum number of groups of lines that can be loaded at the same time in main memory.
batch-lines: Maximum number of lines each group will contained.
num-threads: Number of threads that will be running the statistical tests at the same time. It should be the number of cores in your machine.
mmap-vcf: If set to false, the FILE* C API will be used for file reading. Otherwise, the file will be memory-mapped, increasing performance.

Other tools suites