## Introduction

Most analyses in current bioinformatics consist in performing thousands, millions or even billions of statistical tests in parallel. This applies to the microarrays or RNA-Seq analysis (e.g. differentially expressed genes), motif detection (e.g. discovering motifs in promoters of co-expressed genes), sequence similarity searches (comparing a query sequence against the millions of peptidic sequences currently available in Uniprot), genome-wise association studies (GWAS), and many other applications. A standard search for similarity with BLAST against the whole Uniprot databases amounts to evaluate billions of possible alignments.

In a previous practical (Selecting differentially expressed genes), we applied a Welch’s $$t$$ test to select differentially expressed genes from a microarray series containing 190 ALL samples. By doing this, we successively tested for more than 22,283 probesets the equality of the mean expression values, between two classes of ALL. For each gene, we computed a nominal p-value, which indicates the probability to obtain by chance a difference at least as large as the one observed in the data. This p-value can be interpreted as an estimate of the False positive risk (FPR): the probability of considering an observation as significant whereas it is not. However, we did not take into account an important factor: since the same test was successively applied to 22,283 probeset, the risk of false positives was thus repeated for each probeset. This situation is classically denoted as multiple testing. For example, if we accept an individual risk of 1%, we expect to observe $$1% \cdot 22,283=223$$ false positives when the same risk is taken for each probe of the microarray series.

We will thus need to be more restrictive if we want to control the false positives. Several methods have been proposed to control the risk of false positives in situations of multiple testing. In this practical, we will investigate the practical consequences of multiple testing and explore some of the proposed solutions.

## Generating random control sets

In order to get an intuition of the problems arising from multiple testing, we will generate three datasets where no difference is expected to be found between the mean expression values of two groups of samples.

### Exercise

1. Reload the normalized expression matrix from DenBoer, as described in the practical Selecting differentially expressed genes.

2. Generate an expression matrix of the same size, in a data frame named “rnorm.frame”, and fill if with random values sampled in a normal distribution of mean $$m=0$$ and standard deviation $$sd=1$$.

3. Create a second data frame of the same size, name it “denboer.permuted.values”, and fill it with the actual values of the DenBoer expression matrix, sampled in random order (NB: random sampling can be done with the R function sample()).

4. Create a vector name “denboer.permuted.subtypes” and fill it with a random sampling of the cancer subtypes (these values can be found in pheno\$Sample_title).

View solution| Hide solution