Most analyses in current bioinformatics consist in performing thousands, millions or even billions of statistical tests in parallel. This applies to the microarrays or RNA-Seq analysis (e.g. differentially expressed genes), motif detection (e.g. discovering motifs in promoters of co-expressed genes), sequence similarity searches (comparing a query sequence against the millions of peptidic sequences currently available in Uniprot), genome-wise association studies (GWAS), and many other applications. A standard search for similarity with BLAST against the whole Uniprot databases amounts to evaluate billions of possible alignments.
In a previous practical (Selecting differentially expressed genes), we applied a Welch’s \(t\) test to select differentially expressed genes from a microarray series containing 190 ALL samples. By doing this, we successively tested for more than 22,283 probesets the equality of the mean expression values, between two classes of ALL. For each gene, we computed a nominal p-value, which indicates the probability to obtain by chance a difference at least as large as the one observed in the data. This p-value can be interpreted as an estimate of the False positive risk (FPR): the probability of considering an observation as significant whereas it is not. However, we did not take into account an important factor: since the same test was successively applied to 22,283 probeset, the risk of false positives was thus repeated for each probeset. This situation is classically denoted as multiple testing. For example, if we accept an individual risk of 1%, we expect to observe \(1% \cdot 22,283=223\) false positives when the same risk is taken for each probe of the microarray series.
We will thus need to be more restrictive if we want to control the false positives. Several methods have been proposed to control the risk of false positives in situations of multiple testing. In this practical, we will investigate the practical consequences of multiple testing and explore some of the proposed solutions.
In order to get an intuition of the problems arising from multiple testing, we will generate three datasets where no difference is expected to be found between the mean expression values of two groups of samples.
Reload the normalized expression matrix from DenBoer, as described in the practical Selecting differentially expressed genes.
Generate an expression matrix of the same size, in a data frame named “rnorm.frame”, and fill if with random values sampled in a normal distribution of mean \(m=0\) and standard deviation \(sd=1\).
Create a second data frame of the same size, name it “denboer.permuted.values”, and fill it with the actual values of the DenBoer expression matrix, sampled in random order (NB: random sampling can be done with the R function sample()).
Create a vector name “denboer.permuted.subtypes” and fill it with a random sampling of the cancer subtypes (these values can be found in pheno$Sample_title).
In the context of a Welch test, the p-value indicates the probability to obtain by chance a \(t\) statistics greater than to the one measured from the samples. Under the null hypothesis (i.e. if the data were sampled from populations with equal means) a p-value of 1% is expected to be found at random once in 100 tests, a p-value of 5% once in 20 tests, etc. The goal of this exercise is to test if the p-values returned by our multiple Welch test correspond to this expectation.
For each of the 3 control sets prepared in the previous section, and for the actual data set from DenBoer 2009, apply the following procedure, and compare the results.
As discussed in the solution of the previous section, the density histogram of p-values suggests that the DenBoer dataset contains a mixture of probesets corresponding either of two categories:
Storey and Tibshirani (2003) proposed an elegant way to estimate the number of probesets belonging to these two categories. The next exercise will lead you, step by step, to discover how Storey and Tibshirani estimate the numbers of truly null (\(m_0\)) and alternative (\(m_1\)) hypothesis.
After having estimated these three parameters for the DenBoer expression dataset, do the same for the three negative controls.
A classical approach in statistical tests is to define a priori a level of significance, i.e. a threshold on the p-value. We will reject the null hypothesis if a test returns a lower p-value than the threshold, and accept it otherwise. Rejecting the null hypothesis means that we consider the test as significant. In the current case, this means that rejected hypotheses will correspond to differentially expressed genes.
However, we are confronted to a problem: with the dataset from DenBoer (2009), we ran 22,283 tests in parallel. This means that, if we decide to use a classical p-value threshold of 1%, we accept, for each probeset, a risk of 1% to consider it significant whereas if follows the null hypothesis. Since this risk is multiplied 22,283 times, we expect an average of 223 false positives for any dataset.Count the numbers of significant probesets in the four datasets described above.
The E-value is the expected number of false positives. This is the simplest and most intuitive correction for multiple testing: given a threshold on the nominal p-value and the number of tests performed, we can easily compute the expected number of false positives.
The q-value estimates the False discovery rate (FDR), i.e. the expected proportion of false positives among the cases declared positives. This makes an essential distinction with the False Positive Rate (FPR), which indicates the proportion of false positves among all the tests performed.
Under the null hypothesis, the FDR can be estimated in the following way (hochberg and Benjamini):
Den Boer ML, van Slegtenhorst M, De Menezes RX, Cheok MH, Buijs-Gladdines JG, Peters ST, Van Zutven LJ, Beverloo HB, Van der Spek PJ, Escherich G et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134.
Benjamini Y, Hochberg Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRStatistSocB 57(1):289-300.
Storey JD, Tibshirani R. (2003). Statistical significance for genomewide studies. Proc Natl Acad Sci U S A 100(16): 9440-9445. [PMID 12883005][free article]