Introduction

Most analyses in current bioinformatics consist in performing thousands, millions or even billions of statistical tests in parallel. This applies to the microarrays or RNA-Seq analysis (e.g. differentially expressed genes), motif detection (e.g. discovering motifs in promoters of co-expressed genes), sequence similarity searches (comparing a query sequence against the millions of peptidic sequences currently available in Uniprot), genome-wise association studies (GWAS), and many other applications. A standard search for similarity with BLAST against the whole Uniprot databases amounts to evaluate billions of possible alignments.

In a previous practical (Selecting differentially expressed genes), we applied a Welch’s \(t\) test to select differentially expressed genes from a microarray series containing 190 ALL samples. By doing this, we successively tested for more than 22,283 probesets the equality of the mean expression values, between two classes of ALL. For each gene, we computed a nominal p-value, which indicates the probability to obtain by chance a difference at least as large as the one observed in the data. This p-value can be interpreted as an estimate of the False positive risk (FPR): the probability of considering an observation as significant whereas it is not. However, we did not take into account an important factor: since the same test was successively applied to 22,283 probeset, the risk of false positives was thus repeated for each probeset. This situation is classically denoted as multiple testing. For example, if we accept an individual risk of 1%, we expect to observe \(1% \cdot 22,283=223\) false positives when the same risk is taken for each probe of the microarray series.

We will thus need to be more restrictive if we want to control the false positives. Several methods have been proposed to control the risk of false positives in situations of multiple testing. In this practical, we will investigate the practical consequences of multiple testing and explore some of the proposed solutions.


Generating random control sets

In order to get an intuition of the problems arising from multiple testing, we will generate three datasets where no difference is expected to be found between the mean expression values of two groups of samples.

Exercise

  1. Reload the normalized expression matrix from DenBoer, as described in the practical Selecting differentially expressed genes.

  2. Generate an expression matrix of the same size, in a data frame named “rnorm.frame”, and fill if with random values sampled in a normal distribution of mean \(m=0\) and standard deviation \(sd=1\).

  3. Create a second data frame of the same size, name it “denboer.permuted.values”, and fill it with the actual values of the DenBoer expression matrix, sampled in random order (NB: random sampling can be done with the R function sample()).

  4. Create a vector name “denboer.permuted.subtypes” and fill it with a random sampling of the cancer subtypes (these values can be found in pheno$Sample_title).

View solution| Hide solution

Distribution of the p-values

In the context of a Welch test, the p-value indicates the probability to obtain by chance a \(t\) statistics greater than to the one measured from the samples. Under the null hypothesis (i.e. if the data were sampled from populations with equal means) a p-value of 1% is expected to be found at random once in 100 tests, a p-value of 5% once in 20 tests, etc. The goal of this exercise is to test if the p-values returned by our multiple Welch test correspond to this expectation.

Exercise

For each of the 3 control sets prepared in the previous section, and for the actual data set from DenBoer 2009, apply the following procedure, and compare the results.

  1. Run the Welch test on each probeset, using our custom function t.test.multi() (as we did the practical on Selecting differentially expressed genes), to compare the two subtypes “hyperdiploid” and “TEL-AML1”.
  2. Draw a plot comparing mean expression of “hyperdiploid” (abscissa) and “TEL-AML1” (ordinate), and highlight the probesets declared significant with a p-value threshold of 1%. Count the number of significant genes and compare it to the random expectation.
  3. Draw a histogram of the p-value density, by chunks of 5%, and draw a line representing the theoretical expectation for this distribution.
View solution| Hide solution

Estimating the proportions of truly null and truly alternative hypotheses

As discussed in the solution of the previous section, the density histogram of p-values suggests that the DenBoer dataset contains a mixture of probesets corresponding either of two categories:

  1. Genes whose mean expression value does not differ between the two considered cancer types (hyperdiploid and TEL_AML1). The corresponding probesets are called “truly null”, because they fit the null hypothesis: \[H_0: m_{hyper} = m_{TEL\_AML1}\] The number of truly null probesets will be denoted \(m_0\).
  2. Genes expressed differentially between the two considered cancer types. The corresponding probeset are called “truly alternative” because they fit the alternative hypothesis. \[H_1: m_{hyper} \neq m_{TEL\_AML1}\] The number of truly alternative probesets will be denoted \(m_1\).

Storey and Tibshirani (2003) proposed an elegant way to estimate the number of probesets belonging to these two categories. The next exercise will lead you, step by step, to discover how Storey and Tibshirani estimate the numbers of truly null (\(m_0\)) and alternative (\(m_1\)) hypothesis.

Exercise

  1. In the result table of the Welch test (DenBoer dataset), count the number of probesets having a p-value ≥ 0.6 (let us call the threshold p-value \(\lambda\) (“lambda”), and the number of probeset above the λ p-value \(n_\lambda\)).
  2. We can reasonably assume that the truly alternate probesets will tend to be concentrated in the low range of p-values on the density histogram. Consequently, the we can suppose that the area of the histogram covering p-values from 0.6 to 1 is principally made of truly null probesets. On the basis of the number \(n_\lambda\) (with \(\lambda=60\)), try to estimate
    1. the total number \(m_0\) of truly null probesets.
    2. the total number \(m_1\) of truly alternative probesets.
    3. The proportion \(\pi_0\) (“pi zero”) of truly null among all probesets.

After having estimated these three parameters for the DenBoer expression dataset, do the same for the three negative controls.

View solution| Hide solution

Why do we need to correct for multiple testing ?

A classical approach in statistical tests is to define a priori a level of significance, i.e. a threshold on the p-value. We will reject the null hypothesis if a test returns a lower p-value than the threshold, and accept it otherwise. Rejecting the null hypothesis means that we consider the test as significant. In the current case, this means that rejected hypotheses will correspond to differentially expressed genes.

However, we are confronted to a problem: with the dataset from DenBoer (2009), we ran 22,283 tests in parallel. This means that, if we decide to use a classical p-value threshold of 1%, we accept, for each probeset, a risk of 1% to consider it significant whereas if follows the null hypothesis. Since this risk is multiplied 22,283 times, we expect an average of 223 false positives for any dataset.

Exercise

Count the numbers of significant probesets in the four datasets described above.

View solution| Hide solution

Expected number of false positives (e-value)

Context

The E-value is the expected number of false positives. This is the simplest and most intuitive correction for multiple testing: given a threshold on the nominal p-value and the number of tests performed, we can easily compute the expected number of false positives.

Exercise

  1. The result table of the function t.test.multi() contains a column indicating the p-value. For the 4 datasets (DenBoer + the three negative controls), use the column “P.value” of the t.test.multi() result to compute the E-value (expected number of false positives) associated to each probeset. Compare it to the E-value indicated in the t.test.multi() result.
  2. Count the number of probesets declared “positive” with a threshold of 1% on E-value.
View solution| Hide solution

Family-Wise Error Rate (FWER)

Context

The Family-Wise Error Rate (FWER) is the probability to obtain at least one false positive by chance: P(FP >= 1), for a given threshold on p-value and taking into account the number of tests performed.

Exercise

  1. For each one of the datasets analyzed above, add to the t.test.multi() a column indicating the FWER.
    1. Count the number of probesets declared “positive” with a threshold of 1% on FWER. Compare it with the number of false positives when the control was exerted at the level of nominal p-value, and E-value, respectively.
  2. Draw a plot comparing the E-value and FWER to the nominal p-value, for all the probesets of the DenBoer dataset.
View solution| Hide solution

Controlling the False Discovery Rate (FDR) with the q-value

Context

The q-value estimates the False discovery rate (FDR), i.e. the expected proportion of false positives among the cases declared positives. This makes an essential distinction with the False Positive Rate (FPR), which indicates the proportion of false positves among all the tests performed.

Under the null hypothesis, the FDR can be estimated in the following way (hochberg and Benjamini):

  1. Sort features by increasing p-value (the most significant genes appear in the top of the sorted table).
  2. For each rank \(i\) of the sorted list, compute \(q(i) = \pi_0 * Pval * m / i\).
  3. Choose a given significance level \(q^{*}\) (in our case, let us chose \(q^{*}= 1%\)). Let \(k\) be the largest \(i\) for which \(q(i) <= q^{*}\). Reject all hypotheses \(H_{(i)}\) with \(i = 1, 2, ...,k\).

Exercise

  1. Compute the q-value for the DenBoer dataset and for the three control tests described above.
  2. Generate a plot comparing the nominal p-value (abscissa) with the different corrections (E-value, FWER, q-value).
  3. Count the number of probesets declared significant at a level of 1%, and compare it to the numbers of positives detected above, when the control was performed at the level of the p-value, E-value or FWER.
View solution| Hide solution

Bibliographic references

  1. Den Boer ML, van Slegtenhorst M, De Menezes RX, Cheok MH, Buijs-Gladdines JG, Peters ST, Van Zutven LJ, Beverloo HB, Van der Spek PJ, Escherich G et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134.

  2. Benjamini Y, Hochberg Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRStatistSocB 57(1):289-300.

  3. Storey JD, Tibshirani R. (2003). Statistical significance for genomewide studies. Proc Natl Acad Sci U S A 100(16): 9440-9445. [PMID 12883005][free article]