Functional SNP prioritization

Non-synonymous SNPs

These SNPs are likely to cause some phenotypic effect. The effect of non-synonymous coding SNPs can be analized by means of the physico-chemical properties of the affected proteins. The program SNPeffect (Reumers J. et al, 2005) uses computational tools to predict the effect caused by the mutations on the physico-chemical and biological properties of the affected proteins. Instead of centering around the degree of conservation of the mutated amino acid, SNPeffect tries to pinpoint the exact effect of a mutation to a specific structural or physico-chemical property, ranging from protein aggregation to the disruption of protein-protein interactions or from changes in protein turnover rate to subcellular (mis)localisation.

An estimation of the selective constraints acting at a codon level for non-synonymous coding SNPs can also be obtained by selecting "Pathological mut. predicted by selective constraints (w=dN/dS)" and choosing an omega interval under which to show SNPs. The higher the effects of purifying selection, the lower the w will be. When w=0 all of the sites compared are identical between sequences. Choosing w values ranging between 0 and 1 (selected by default) the SNP's association to mutations observed at high frequencies in human disease can be predicted (Arbiza L. et al, in press). Other ranges may be selected. The highest recommended is w<=0.2 given that as reported in Arbiza L et al, codons with a probability of 95% of being affected by purifying selection were not observed beyond this value. The maximum value of w=1 is set given that the statistical tests providing the necessary support for inference of positive selection (w>1) are not included in this version. Estimates of selective pressures at a codon level are obtained through two different methods.

  • For w-bay values, a maximum likelyhood adjustment of parameters in models considering different classes of sites under purifying selection, neutral evolution, or positive selection, is used (Yang Z. et al., 2002.). Two models, implemented in PAML, are used for computation: M2 & M8 which assume different distributions for omega classes. In both cases the estimation of codon w-bay values is done through the Bayes Empirical Bayes approach. The estimates of w-bay for the codon is presented under the user chosen range of w values for either, the M2, or the M8 model depending on which has a better fit as determined by log-likelihood values.
  • The w-slr value shown is obtained by the use of the Slr program by Massingham T. et al.. In this case, a similar but modified approach for estimation of actual site-wise w values is employed. Additionally a different method for a site by site analysis with a statistical definition of positive or negative selection is done by employing a site-wise likelihood ratio test and evalutaing p-values adjusted for multiple testing.

In all cases, the maximum number of orthologous sequences available from the Ensembl vertebrates set (human, chimp, mouse, rat, dog, opossum, chicken, pufferfish, tetraodon, zebrafish, and frog) are used for multiple sequence alignment. Ortholog annotations are obtained from the corresponding version of the Ensembl-Compara DB.

TFBS

In the search for SNPs with potential phenotypic effect, the region 5Kb upstream the genes (belonging to the promoter region of each gene in the list) is scanned for the presence of possible TFBS. Program MatchTM (Kel et al., 2003) from the TransfacTM database (Wingender et al., 2000) was used for this purpose. SNPs located within these motifs are considered to have a putative phenotypic effect in the expression of the gene.

As the presence or not of TFBSs is only a prediction, users have the posibility of make the prediction more likely by restricting the filtering to SNPs located in putative TFBSs within mouse conserved (or high conserved) regions.

The parameters used in the TFBS analysis were:

  • Vertebrate matrices
  • Use high quality matrices
  • Minimize false positives

A complementary approach for TFBS identification has been included, which uses the position weight matrices (PWM) deposited in JASPAR . JASPAR is an open-access database of annotated, high-quality, matrix-based transcription factor binding site profiles for multicellular eukaryotes. It contains models derived from 111 profiles that were exclusively derived from published collections of experimentally defined TFBSs for multicellular eukaryotes. We use the matrices corresponding to vertebrates to search for TFBSs in the 5Kb upstream region of all the human genes. To this end we use MatScan, a program to search binding sites in genomic sequences. Since MatScan does not allow a cutoff to minimize false positives, we also use the Meta program to filter the results by searching the coincidences of TFBSs in orthologous genes in mouse.

Exonic Splicing Enhancer (ESE)

Mutations that inactivate ESE sequences may result in exon skipping, malformation, etc. ESEs also appear to be important in exons that normally undergo alternative splicing. Different classes of ESE consensus motifs have been described, but they are not always easily identified. We have developed a script that scan into exon sequences to identify putative ESEs responsive to the human SR proteins SF2/ASF, SC35, SRp40 and SRp55, by using the nucleotide-frequency matrices available for them (Cartegni et al., 2003). A score is obtained, related to the likelihood of the site found is a real ESE. Only ESE sites with scores over the threshold (Cartegni et al., 2003) are taken into account in the analysis. If a SNP disrupts one of these sequences, the new score corresponding to the mutated sequence is also calculated. Strong differences in both score values suggest more drastic effects caused by the SNP.
Again, to make the prediction safer, the search can be done in all regions, only in mouse conserved regions or in mouse high conserved regions.

Exonic Splicing Silencer (ESS)

Exonic splicing silencers are other cis-regulatory elements that inhibit the use of adjacent splic sites, often contributing to alternative splicing. Wang Z. et al. described a list of 103 hexamers (the FAS-hex-3 set) identified as ESS candidates by genetic selection; we scanned the exon sequences of the human genes to identify putative ESSs from Wang'set. SNPs located at these motifs were recorded.
To make the prediction safer, the search can be done in all regions, only in mouse conserved regions or in mouse high conserved regions.
Triplex
DNA triplexes (Pauling and Corey, 1953; Felsenfeld et al., 1957) have been suggested as regulatory regions for controlling gene expression (Goñi et al., 2004). DNA triplexes are sequences larger than 10 polypurines or polypirimidines, and SNPs located in the middle of those sequences can affect to the triplex formation and hence they can disturb the normal regulation of a particular gene.
In order to detect those possible funtional SNPs, we scan sequences from -5Kb upstream to the 3' end of the genes (taken from Ensembl) looking for sequences of more than 10 polypurines or polypirimidines (putative DNA triplexes), and then SNPs located at those regions are selected as putative triplex disrupting SNPs.

Splice sites

The intron/exon structure of the genes and the corresponding sequences were extracted in order to find all the SNPs altering the two conserved nucleotides at each side of the splicing point that constitute the splicing signal.
Also, we use GeneID to find new possible splice sites. Gene ID is a program to predict genes in genomic sequences, where splice sites, and start and stop codons are predicted and scored along the sequence using position weight matrices (PWMs). We use this program to scan the whole genome to find new splice sites and to map SNPs that could have a putative effect in the disruption of these important sites.

miRNAs and their targets

MicroRNAs (miRNA) act as repressors of protein coding genes by binding to target sites in the 3' UTR of mRNAs. In this release, we scan the genome to find SNPs located at miRNAs. Besides, we use miRanda, an algorithm for the detection of potential microRNA target sites in genomic sequences, to localize all the SNPs situated in the region 3' UTR of these targets sites. Both SNPs at miRNAs and SNPs in their target sequences could have an effect in the normal function of these regulatory elements. This effect is measured by the difference of scores among the alleles of the SNPs.

Mouse/Human conserved regions

Conserved region are obtained directly from ensembl. The human/mouse whole genome comparison is performed using the BLASTz program to obtain the conserved region. The BLASTz dataset is further processed to produce a highly conserved regions subset by rescoring the initial alignments through the so-called 'tight' nucleotide scoring matrix (using the 'subsetAxt' program from Jim Kent, UCSC) with a gap open penalty of 2000 and a gap extension penalty of 50, as described in the ensembl multicontigview help page.

Also available in: HTML TXT