- MarmiteScan tutorial
MarmiteScan comes out as the application of a threshold free method (FatiScan) that extracts blocks of related genes from an ordered list of genes by an associated value to the Marmite tool, a tool that finds differential distributions of bioentities extracted from PubMed between two groups of genes.
We have human genes associated to a set of bioentities by a score indicating the importance of that co-occurrence in the literature. MarmiteScan gets a list of genes ordered by an associated value and applies a threshold free method (FatiScan) that produce a set of serial partitions dividing the list in two groups. MarmiteScan tests whether the distributions of the scores, that is, the importance of the association to a bioentity, differ in any of the groups.
MarmiteScan tells you whether there is an enrichment of any bioentity in your list of sorted genes.
Bioentity and gene co-occurences¶
Starting with a set of documents (e.g. the documents where a certain gene appears or a disease) we can define keywords as those words that are significantly overrepresented compared to a standard set or background. These words that appear with much higher frequencies than one would expect from chance alone can be considered as the content words that capture the main features in this set of documents. In addition to single words bi-grams (two adjacent words) were taken into account because in many cases these terms contain more information than single words (e.g. “cell cycle” vs. “cell”, “cycle”). We refer to words and bi-grams as terms in the following. All words were stemmed before further treatment to increase statistical significance of words. For each term i the number of documents where i appears in the whole collection of documents (xi in N, our background) and in a specific document set a (xia in na) is calculated. Then, based on the hypergeometric distribution, the likelihood to find Xia documents in a set of the size n is computed for each term. The more unlikely this event is the more specific is the term i for the document set.
Na ... number of documents of the set a Ndoc ... number of documents in the entire collection Xi ... number of documents where term i appears in Ndoc Xia ... number of documents where term i appears in Na Formula for calculating keyword relevance: Mean value for term i in collection Na : Mia = Na * (Xi /Ndoc) The standard deviation of the distribution : ?ia = sqrt(Mia * (1 - Xi/Ndoc) * (1 - Na/Ndoc)) The Z-score for each term i in a; the higher the score the more relevant is a term for the document set : Zia = (Xia - Mia)/?ia
MarmiteScan applies a serial partitioning process to the gene list according to the values they have associated. The size of the windows depends on the values associated to the genes.
For each partition MarmiteScan evaluates the differences between the gene-bioentity co-ocurrences values (scores) for the two groups of genes (top genes and bottom genes). We apply a Kolmogorov-Smirnov Test to each pair of distributions formed by the scores of the coocurrences between a bioentity and the genes within the list. No null values are included into the distributions to evaluate, that is, only genes with a score indicating co-occurrence with the bioentity are included.
MarmiteScan only evaluates bioentities associated to a minimum number of genes within both groups (minimum and default is 5 although it can be set by user).
We apply the test firstly in one side, testing whether top genes distribution is greater than bottom genes distribution. If the test p-value is greater than 0.5, then we apply the other hypothesis, bottom genes distribution is greater than top genes distribution. Finaly we show more probable hypothesis, that is, the one with smaller p-value.
MarmiteScan have into account multiple test problems and adjusts p-values using FDR.
Options to select¶
- Type of entity - Users can evaluate their genes using three categories of bioentities (disease associated words, chemical products, word roots).
- Filtering entities to test - Select minimum number of genes with a score for an entity. Entities with less than this number in both lists will be excluded from the analysis. Deafault and minimum is 5
- Number of entities to present in results - Select number of bioentities presented in result page. Entities with signicative p-values will be always shown anyway, so never this restriction produces a lack of relevant information. Setting as 0 means only significative bioentities are showed.
- Submit gene lists - Please click this checkbox if your lists are made of only gene names [HGNC ids, HUGO ids, common names]. The annotations are done using HUGO ids, so what MarmiteScan does is to convert any gene id to HUGO id through an ensembl id, if you provides gene names the conversion process will be omitted. See that if you provide gene names and don't click the box some genes may be excluded from the analysis because they match with two ensembl ids or the ensembl id match with two HUGO names.
- Do you want us to sort genes/values for you? Indicate direction - Indicate whether your gene list is ordered or do you want us to order it for you. This option may also be used to change the hypothesis to test.
We only provides data for human.
To submit your lists of genes make sure you provide a column of gene identifiers followed by a column of values (separated by tab) and a new line at the end of each gene/value pair. Something like:
ENSG00000195449 2.05 ENSG00000191414 2.02 ENSG00000195603 1.95 ENSG00000191766 1.83 ENSG00000192778 1.56 ENSG00000192318 1.23 ENSG00000195909 1.22 ENSG00000195044 1.10 ENSG00000191421 0.85 ENSG00000190549 0.84 ENSG00000194579 0.79 ENSG00000193697 0.53 ENSG00000192817 0.41 ENSG00000189656 0.12 ENSG00000189674 0.01 ENSG00000190567 -0.03 ENSG00000195016 -0.12 ...
Basically, MarmiteScan returns a table containing the found entities after applying the Kolmogorov-Smirnov test to the input genes. In addition, for each entity MarmiteScan returns its corresponding statistic, p-value and adjusted p-value (see image below):