Single enrichment analysis: Two-steps functional analysis¶
The functional interpretation of genomic data is usually performed by studying the enrichment of any type of biologically relevant annotation in the genes or proteins selected by the experiment with respect to the corresponding distribution of the annotation in the background, typically the rest of genes or proteins in the genome.
Single enrichment analysis is less sensitive than gene set analysis and is reccommended in situations in which the genes are selected in the experiment in a categorical way (for example, because they are present in amplified or deleted regions or they are targets of regulatory factors, etc.)
In many cases this selection of genes is performed by multiple individual, gene-wise tests. This testing strategy is quite conservative and produces, at the end, a loss of testing power in the whole procedure because a large number of false negatives are sacrificed in order to preserve a low ratio of false positives.
- FatiGO takes two lists of genes. Ideally a group of interest and the rest of the genes in the experiment, although any two groups formed in any way, can be tested against each other.
- These two lists are converted into two lists of functional terms using the corresponding gene or protein - term annotation table.
- Then a Fisher's exact test for 2×2 contingency tables is used to check for significant over-representation of functional terms in one of the lists with respect to the other one.
- Multiple testing correction to account for the multiple hypothesis tested (one for each functional term) is applied. FatiGO uses the FDR B&H method.
The functionality of the old modules FatiWise and TransFat (Al-Shahrour et al., 2005) and FatiGO+ (Al-Shahrour et al., 2007) have been completely included here and, consequently these modules have been discontinued.
Multiple testing problem¶
Great caution should be adopted when dealing with a large set of data because of the high occurrence of spurious associations (Ge et al., 2003).
Addressing multiple testing properly is a rather complex problem. Many of the conventional correction methods (e.g. Bonferroni or Sidak) are based on the consideration that a pvalue should be adjusted by multiplying a reasonable significant threshold (e.g. p< 0.05) for the number of tests performed to obtain a new threshold. Whenever many thousands of tests are performed the original assumption risks to be too conservative. The multiple testing problem in functional assignation does not require protection against even a single false positive. In this case, the drastic loss of power involved in such protection is unjustified. It is more appropriate to control the proportion of errors among the identified functional terms whose differences among groups of genes or proteins cannot be attributed to chance instead. The expectation of this proportion is the False Discovery Rate (FDR). Different procedures offer strong control of the FDR under independence and some specific types of positive dependence of the tests statistics (Benjamini and Hochberg, 1995), or under arbitrary dependency of test statistics (Westfall and Young, 1993).
FatiGO returns adjusted pvalues based on FDR method of accounting for multiple testing (Benjamini and Hochberg 1995).
- Al-Shahrour, F., Minguez, P., Tárraga, J., Medina, I., Alloza, E., Montaner, D., & Dopazo, J. (2007). FatiGO+: a functional profiling tool for genomic data. Integration of functional annotation, regulatory motifs and interaction data with microarray experiments. Nucleic Acids Research 35 (Web Server issue): W91-96
- Al-Shahrour, F., Minguez, P., Tárraga, J., Montaner, D., Alloza, E., Vaquerizas, J.MM., Conde, L., Blaschke, C., Vera, J. & Dopazo, J. (2006). BABELOMICS: a systems biology perspective in the functional annotation of genome-scale experiments. Nucleic Acids Research (Web Server issue) 34: W472-W476
- Al-Shahrour, F., Minguez, P., Vaquerizas, J.M., Conde, L. & Dopazo, J. (2005). BABELOMICS: a suite of web-tools for functional annotation and analysis of group of genes in high-throughput experiments. Nucleic Acids Research, 33 (Web Server issue): W460-W464
- Al-Shahrour, F., Díaz-Uriarte, R. & Dopazo, J. (2004). FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics 20: 578-580