Home >> Expression documentation

Predictors tutorial

Introduction and purpose

In the last years the use of microarrays as predictors of clinical outcomes (van 't Veer et al., 2002), despite not being free of criticisms (Simon, 2005), has fuelled the use of the methodology because of its practical implications in biomedicine. Many other fields, such as agriculture, toxicology, etc are now using this methodology for prognostic and diagnostic purposes.

Predictors are used to assign a new data (a microarray experiment in this case) to a specific class (e.g. diseased case or healthy control) based on a rule constructed with a previous dataset containing the classes among which we aim to discriminate. This dataset is usually known as the training set. The rationale under this strategy is the following: if the differences between the classes (our macroscopic observations, e.g. cancer versus healthy cases) is a consequence of certain differences an gene level, and these differences can be measured as differences in the level of gene expression, then it is (in theory) possible finding these gene expression differences and use them to assign the class membership for a new array. This is not always easy, but can be aimed. There are different mathematical methods and operative strategies that can be used for this purpose.

Class prediction is a web interface to help in the process of building a “good predictor”. We have implemented several widely accepted strategies so as Class prediction can build up simple, yet powerful predictors, along with a carefully designed cross-validation of the whole process (in order to avoid the widespread problem of “selection bias”).

Class prediction allows combining several classification algorithms with different methods for gene selection.
How to build a predictor

A predictor is a mathematical tool that is able to use a data set composed by different classes of objects (here microarrays) and “learn” to distinguish between these classes. There are different methods that can do that (see below). The most important aspect in this learning is the evaluation of the performance of the classifier. This is usually carried out by means of a procedure called cross-validation. The figure illustrates the way in which cross-validation works. The original dataset is randomly divided into several parts (in this case three, which would correspond to the case of three-fold cross validation). Each part must contain a fair representation of the classes to be learned. Then, one of the parts is set aside (the test set) and the rest of the parts (the training set) are used to train the classifier. Then, the efficiency of the classifier is checked by using the corresponding test set which has not been used for the training of the classifier. This process is repeated as many times as the number of partitions performed and finally, an average of the efficiency of classification is obtained.

Figure 1 - Cross validation

Methods

We have included in the program several methods that have been shown to perform very well with microarray data (Dudoit et al., 2002; Romualdi et al., 2003; Wessels et al., 2005). These are support vector machines (SVM), k-nearest neighbor (NN), diagonal linear discriminant analysis (DLDA), SOM and shrunken centroids (PAM).

Classification methods

  • Diagonal Linear Discriminant Analysis (DLDA): DLDA is the simplest classifier based on finding elements closer to the average value of the class. In spite of its simplicity and its somewhat unrealistic assumptions (independent multivariate normal class densities), this method has been found to work very well (Dudoit et al., 2002).
  • Nearest neighbour (KNN): KNN is a non-parametric classification method that predicts the sample of a test case as the majority vote among the k nearest neighbors of the test case (Ripley 1996; Hastie et al., 2001). The number of neighbors used (k) is often chosen by cross-validation. (Ripley 1996; Hastie et al., 2001).
  • Support Vector Machines (SVM): Recently, SVM (Vapnik, 1999) are gaining popularity as classifiers in microarrays (Furey et al., 2000; Lee & Lee, 2003; Ramaswamy et al., 2001). The SVM tries to find a hyperplane that separates the data belonging to two different classes the most. It maximizes the margin, which is the minimal distance between the hyperplane and each data class. When the data are not separable, SVM still tries to maximize the margin but allow some classification errors subject to the constraint that the total error (distance from the hyperplane to the wrongly classifies samples) is below a threshold.
  • PAM or Shrunken centroids: The PAM method (Tibshirani et al., 2002) is very similar to a DLDA but using the centroids (shrunken) instead of the mean. It is supposed to be more efficient.
  • SOM method (Kohonen et al., 1984): The SOM solves difficult high-dimensional and nonlinear problems such as feature extraction and classification. We use ourr own implementation with some improvements.

Variable selection: finding the "important genes"

Some methods require of a previous selection of genes for the learning process. In this case one might want to preselect the genes which will potentially provide more accuracy to the predictor. This step of gene selection is called, in the machine learning literature, the “filter approach” because we first “filter” the predictor variables (in our case genes), keep only a subset, and then build the predictor. Here we have implemented two ways of ranking genes, that can be used in combination with any of the above class-prediction algorithms.

  • F-ratio, or between to within classes sums of squares (Dudoit et al., 2002).
  • Wilcoxon statistic a non-parametric test for differences between two classes.

After ranking the genes, we examine the performance of the class prediction algorithm using different numbers of the best ranked genes and select the best performing predictor. In the current version of Class prediction we build the predictor using the best 2, 5, 10, 20, 35, 50, 75, 100 genes. You can choose other combinations of numbers anyway. Note, however, that most of the methods require starting by, at least, 2 genes.

Potential sources of errors

Selection bias

If the gene selection process is not taken into account in the cross-validation, the estimations of the errors will be artificially optimistic. This is the problem of selection bias, which has been discussed several times in the microarray literature (Ambroise & McLachlan 2002; Simon et al. 2003). Essentially, the problem is that we use all the arrays to do the filtering, and then we perform the cross-validation of only the classifier, with an already selected set of genes. This cannot account properly for the effect of pre-selecting the genes. As just said, this can lead to severe underestimates of prediction error (and the references given provide several alarming examples). In addition it is very easily to obtain (apparently) very efficient predictors with completely random data, if we do not account for the pre-selection.

Finding the best subset among many trials

The optimal number of genes to be included in a predictor is not know beforehand and it depends on the own predictor and on the particular dataset. A rather intuitive way of guessing about it is by building predictors for different number of genes (see above). Unfortunately, by doing this we are again falling into a situation of selection bias, because we are estimating the error rate of the predictor without taking into account that we are choosing the best among several trials (8 in this case).

Thus, another layer of cross-validation has been added. We need to evaluate the error rate of a predictor that is built by selecting among a set of rules the one with the smallest error.

The cross-validation strategy implemented here return the cross-validated error rate of the complete process. The cross-validation affect to the complete process of building several predictors and then choosing the one with the smallest error rate.

The complete strategy for building a predictor

There are two main factors that we consider in the strategy for building the optimal predictor: which is the best prediction method and what is the optimal (minimal) number of genes enough to render a good prediction.

The strategy for building an unbiased predictor is as follows:

1. Class prediction produces as many leave-one-out (LOO) samples as arrays in the complete dataset. And for each LOO sample Class prediction ranks the genes with the ranking method chosen (Wilcoxon or F ratio).

2. Then, for each LOO sample different subsets (2, 5, 10, 20, 35, 50, 75 and 100 by default but other combinations can be provided by the user) of more discriminator genes are selected from the top of the ranking and each of the selected prediction method is trained with each of this subsets of genes.

3. For each combination of subset of genes and method the corresponding left-out sample is predicted to compute the cross-validation error.

4. The optimal number of more discriminator genes along with the best classification method can be decided on the results and saved for further use.