HPG Variant GWAS

This tool includes parallel implementations of several genomic analysis. During the past years, the application of these analysis in DNA microarray-based studies has been determinant in the identification of causal variants. Unfortunately, new sequencing technologies are producing such amount of data that it cannot be easily analyzed using traditional programs.

HPG Variant GWAS tool takes as a reference some tests available in PLINK, improving their performance both in execution time and memory usage, as well as storing genotypical information following the VCF format instead of PED format. The implemented features include case-control association analysis such as chi-square and Fisher's exact test, and family-based analysis such as transmission disequilibrium test (TDT). Mendelian errors are automatically removed during the analysis.

Supported input formats

If you have both genotypical and phenotypical information stored in PED & MAP files, you can transform them to VCF using the following script: ped2vcf.sh
The first argument of the script is the name of the PED & MAP files (without file extension), and the second one is the name of the output VCF file. Make sure that PLINK and PlinkSeq are installed in your system and can be accessed via the plink and pseq commands!

Basic usage

Currently, two different kinds of analysis can be conducted: case-control and family-based ones. In both cases, a VCF file and PED file must be provided as input. The VCF file contains the genotypical information, whereas the PED file contains the phenotypes and familial relationships (the first 6 columns).

Note: The VCF file can also be compressed with GZIP in tar.gz format.

Case-control association can be based on a chi-square or a Fisher's exact test. In both cases, the command-line is very similar:

hpg-var-gwas assoc --chisq -v your_vcf_file.vcf -p your_ped_file.ped
hpg-var-gwas assoc --fisher -v your_vcf_file.vcf -p your_ped_file.ped

In case you want to run a transmission-disequilibrium test (TDT), the command-line would be:

@hpg-var-gwas tdt --vcf-file your_vcf_file.vcf --ped-file your_ped_file.ped

First, the PED file will be read, then the VCF file will be processed in sets of lines. Every time a request for one of these sets is analyzed, a pack of lines similar to the following will appear. The number of lines will differ depending on how you configured Variant GWAS (more info in the Advanced configuration section below), but the default will probably be 4.

Mon Dec  3 18:27:54 2012        INFO    src/gwas/assoc/assoc_runner.c [158] in run_association_test(): Batch 0 reached by thread 2 - 196/200 records 
Mon Dec 3 18:27:54 2012 INFO src/gwas/assoc/assoc_runner.c [158] in run_association_test(): Batch 0 reached by thread 3 - 200/200 records
Mon Dec 3 18:27:54 2012 INFO src/gwas/assoc/assoc_runner.c [158] in run_association_test(): Batch 0 reached by thread 1 - 200/200 records
Mon Dec 3 18:27:54 2012 INFO src/gwas/assoc/assoc_runner.c [158] in run_association_test(): Batch 0 reached by thread 0 - 200/200 records
The output of the statistical tests will be named, depending on the test you've run:
  • hpg_variant.chisq
  • hpg_variant.fisher
  • hpg_variant.tdt

Chi-square test output

#CHR         POS       A1      C_A1    C_U1         F_A1            F_U1       A2      C_A2    C_U2         F_A2            F_U2              OR           CHISQ         P-VALUE
1 742429 C 15 28 0.156250 0.144330 T 81 166 0.843750 0.855670 1.097884 0.072255 0.788082
1 767376 A 0 1 0.000000 0.005208 G 98 191 1.000000 0.994792 0.000000 0.512183 0.474195
1 769185 G 15 24 0.153061 0.122449 A 83 172 0.846939 0.877551 1.295181 0.532127 0.465714
1 775852 A 18 30 0.187500 0.156250 G 78 162 0.812500 0.843750 1.246154 0.450000 0.502335

The first two columns are the chromosome and position where a mutation occurs.

Each allele is then described by its bases, the times it appeared in affected (A) or unaffected (U) samples, and the frequency. In case of position 1:742429, allele C was found 15 times in affected samples and 28 times in unaffected ones; allele T was found in 81 and 166, respectively. The 15 affected samples where A was found are a 15.6% of the sum of these values.

Given these counts and frequencies, the odds-ratio (OR), chi-square statistical value (CHISQ) and p-value are calculated.

Fisher's exact test output

#CHR         POS       A1      C_A1    C_U1         F_A1            F_U1       A2      C_A2    C_U2         F_A2            F_U2              OR         P-VALUE
1 742429 C 15 28 0.156250 0.144330 T 81 166 0.843750 0.855670 1.097884 0.860884
1 767376 A 0 1 0.000000 0.005208 G 98 191 1.000000 0.994792 0.000000 1.000000
1 769185 G 15 24 0.153061 0.122449 A 83 172 0.846939 0.877551 1.295181 0.470438
1 775852 A 18 30 0.187500 0.156250 G 78 162 0.812500 0.843750 1.246154 0.506363

The first two columns are the chromosome and position where a mutation occurs.

Each allele is then described by its bases, the times it appeared in affected (A) or unaffected (U) samples, and the frequency. In case of position 1:742429, allele C was found 15 times in affected samples and 28 times in unaffected ones; allele T was found in 81 and 166, respectively. The 15 affected samples where A was found are a 15.6% of the sum of these values.

Given these counts and frequencies, the odds-ratio (OR) and p-value are calculated.

Transmission-disequilibrium test output

#CHR         POS       A1      A2         T       U           OR           CHISQ         P-VALUE
1 742429 C T 11 9 1.222222 0.200000 0.654721
1 3502376 A G 3 11 0.272727 4.571429 0.032509
1 4396519 A G 12 18 0.666667 1.200000 0.273322
1 4396566 G A 2 1 2.000000 0.333333 0.563703
1 4964065 G C 24 28 0.857143 0.307692 0.579100

The first two columns are the chromosome and position where a mutation occurs.

Both alleles are then shown, as well as the times the first allele was transmitted (T) or untransmitted (U). In case of position 1:742429, allele C was transmitted 15 times and not transmitted 9 times.

Given these counts, the odds-ratio (OR), chi-square statistical value (CHISQ) and p-value are calculated.

Setting the species to analyze

Although Variant GWAS is by default configured to work with the human genome, you can change this setting to your preferred species. There are two ways to do this:

  • If you just want to run the program once for that species, the easiest way is to set the --species flag. The list of species you can analyze is available in http://ws.bioinfo.cipf.es/cellbase/rest/latest/. The value of the flag must be the first column of the table (hsa, mmu, ssc...).
  • If you always analyze one specific species, it is faster to edit the hpg-variant.conf file stored in your $(HOME)/.hpg-variant folder. This file will be created the first time you run HPG Variant, so you can run it once with the --species flag set, then edit the global/species entry of the configuration file.

Filtering the input

Sometimes, the dataset can be populated with irrelevant or low-quality data. These data can be removed at the same time as the file is read by applying some filters. The list of implemented filters is available at the HPG Variant VCF Tools wiki page.

In addition to the output files described above, when a filter is applied, two more are also created. One of them contains the list of variants that passed the filter and the other contains the ones that didn't. By default, they are named after the input file, plus a .filtered or .rejected suffix. If you want to change the name of these files, please set the --out flag in your command-line.

Redirecting the output

By default, HPG Variant saves its output in the /tmp/variant/ folder. If you want to use a different one, you can set the --outdir flag, or edit the global/outdir property of the hpg-variant.conf file stored in your $(HOME)/.hpg-variant folder. This file will be created the first time you run HPG Variant, so you can run it once with the --outdir flag set, then edit the global/outdir entry of the configuration file.

Warning: If you run HPG Variant several times in a row, don't forget to save the results in different folders! The contents of the output folder are overwritten with each execution.

Advanced configuration

There are some other options that may improve the performance of HPG Variant but are not recommended to tweak without basic knowledge of the machine architecture where the application runs. These are:

max-batches: The maximum number of groups of lines that can be loaded at the same time in main memory.
batch-lines: Maximum number of lines each group will contained.
num-threads: Number of threads that will be running the statistical tests at the same time. It should be the number of cores in your machine.
entries-per-thread: Number of variants a thread will ask about in the same database query. Must not be greater than batch-lines.
mmap-vcf: If set to false, the FILE* C API will be used for file reading. Otherwise, the file will be memory-mapped, increasing performance.

Quality control (to be implemented)

Per individual QC

  • Identification of individuals with discordant sex information
  • Missing genotype
  • Duplicates and related individuals (Identity by state IBS)
  • Divergent ancestry

Per variant QC

  • Missing genotype
  • Deviation from HWE (in control samples)
  • Genotype rates difference in controls and cases
  • MAF

Bibliography

2011-NatureProtocols-Clarke-Basic statistical analysis in genetic case-control studies

ped2vcf.sh - Script for PED&MAP to VCF conversion (403 Bytes) Cristina Gonzalez, 12/03/2012 06:18 pm