Solution

```{r null_counts_per_sample, fig.path="figures/schurch2016_", fig.width=6, fig.height=8, fig.cap="**Percentage of null counts per sample. **"} prop.null <- apply(count.table, 2, function(x) 100*mean(x==0)) print(head(prop.null)) barplot(prop.null, main="Percentage of null counts per sample", horiz=TRUE, cex.names=0.5, las=1, col=expDesign$color, ylab='Samples', xlab='% of null counts') ## Some genes were not detected at all in these samples. We will discard them. count.table <- count.table[rowSums(count.table) > 0,] ``` # Selecting random samples One of the questions that will drive the analysis will be to define the impact of the number of biological samples on the results. The original study contained 48 replicates per genotype, what happens if we select a smaller number? Each attendee of this course select a given number (e.g. 3, 4, 5, 10, 15, 20, 35, 40, 45...) and adapt the code below run the analysis with that number of replicates per genotype. We will at the end then compare the results (number of genes, significance, ...). ```{r} nb.replicates <- 10 ## Each attendee chooses a number (3,4,5,10,15 or 20) samples.WT <- sample(1:48, size=nb.replicates, replace=FALSE) ## Random sampling of the Snf2 replicates (columns 49 to 96) samples.Snf2 <- sample(49:96, size=nb.replicates, replace=FALSE) selected.samples <- c(samples.WT, samples.Snf2) # Don't forget to update colors col.pheno.selected <- expDesign$color[selected.samples] ``` # Differential analysis with DESeq2 In this section we will search for genes whose expression is affected by the genetic invalidation. You will first need to install the **DESeq2** bioconductor library then load it. ```{r require_DESeq2} ## Install the library if needed then load it if (!require("BiocManager", quietly = TRUE)){ install.packages("BiocManager") BiocManager::install() } if(!require("lazyeval")){ install.packages("lazyeval") } if(!require("DESeq2")){ BiocManager::install("DESeq2") } library("DESeq2") ``` ## Creating a DESeqDataSet dataset We will then create a **DESeqDataSet** using the **DESeqDataSetFromMatrix()** function. Get some help about the **DESeqDataSet** and have a look at some important accessor methods: **counts**, **conditions**, **estimateSizeFactors**, **sizeFactors**, **estimateDispersions** and **nbinomTest**. ```{r create DESeqDataSet object} ## Use the DESeqDataSetFromMatrix to create a DESeqDataSet object dds0 <- DESeqDataSetFromMatrix(countData = count.table[,selected.samples ], colData = expDesign[selected.samples,], design = ~ strain) print(dds0) ## What kind of object is it ? is(dds0) isS4(dds0) ## What does it contain ? # The list of slot names slotNames(dds0) ## Get some help about the "CountDataSet" class. ## NOT RUN #?"DESeqDataSet-class" ``` ## Normalization The normalization procedure (RLE) is implemented through the **estimateSizeFactors** function. ### How is the scaling factor computed ? Given a matrix with $p$ columns (samples) and $n$ rows (genes) this function estimates the size factors as follows: Each column element is divided by the **geometric means** of the rows. For each sample, the **median** (or, if requested, another location estimator) **of these ratios** (skipping the genes with a geometric mean of zero) is used as the size factor for this column. The scaling factor for sample $j$ is thus obtained as: $$sf_{j} = median(\frac{K_{g,j}}{(\prod_{j=1}^p K_{g,j})^{1/p}}) $$ ```{r} ### Let's implement such a function ### cds is a countDataset estimSf <- function (cds){ # Get the count matrix cts <- counts(cds) # Compute the geometric mean geomMean <- function(x) prod(x)^(1/length(x)) # Compute the geometric mean over the line gm.mean <- apply(cts, 1, geomMean) # Zero values are set to NA (avoid subsequentcdsdivision by 0) gm.mean[gm.mean == 0] <- NA # Divide each line by its corresponding geometric mean # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...) # MARGIN: 1 or 2 (line or columns) # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN # FUN: the function to be applied cts <- sweep(cts, 1, gm.mean, FUN="/") # Compute the median over the columns med <- apply(cts, 2, median, na.rm=TRUE) # Return the scaling factor return(med) } ``` Now, check that the results obtained with our function are the same as those produced by DESeq. The method associated with normalization for the "CountDataSet" class is **estimateSizeFactors()**. ```{r before_vs_after_normalisation, fig.path="figures/schurch2016_", fig.width=8, fig.height=8, fig.cap="**Impact of the count normalization. ** "} ## Normalizing using the method for an object of class"CountDataSet" dds.norm <- estimateSizeFactors(dds0) sizeFactors(dds.norm) ## Now get the scaling factor with our homemade function.cds.norm head(estimSf(dds0)) all(round(estimSf(dds0),6) == round(sizeFactors(dds.norm), 6)) ## Checking the normalization par(mfrow=c(1,2),cex.lab=0.7) boxplot(log2(counts(dds.norm)+epsilon), col=col.pheno.selected, cex.axis=0.7, las=1, xlab="log2(counts)", horizontal=TRUE, main="Raw counts") boxplot(log2(counts(dds.norm, normalized=TRUE)+epsilon), col=col.pheno.selected, cex.axis=0.7, las=1, xlab="log2(normalized counts)", horizontal=TRUE, main="Normalized counts") ``` ```{r} if(!require("patchwork")){ install.packages("patchwork") } p1 <- ggplot(data=count_melt, mapping=aes(x=value, color=variable)) + geom_density() + theme(legend.position = "none") count_norm_melt <- melt(log2(counts(dds.norm, normalized=TRUE)+epsilon)) head(count_norm_melt) p2 <- ggplot(data=count_norm_melt, mapping=aes(x=value, color=Var2)) + geom_density() + theme(legend.position = "none") p1 + p2 ``` ## Count variance is related to mean As you can see from the following plot the relationship between variance and mean is not strictly linear. This can be shown by the poor fit that is obtained using a linear regression. ```{r warning=FALSE} ## Computing mean and variance norm.counts <- counts(dds.norm, normalized=TRUE) mean.counts <- rowMeans(norm.counts) variance.counts <- apply(norm.counts, 1, var) ## sum(mean.counts==0) # Number of completely undetected genes norm.counts.stats <- data.frame( min=apply(norm.counts, 2, min), mean=apply(norm.counts, 2, mean), median=apply(norm.counts, 2, median), max=apply(norm.counts, 2, max), zeros=apply(norm.counts==0, 2, sum), percent.zeros=100*apply(norm.counts==0, 2, sum)/nrow(norm.counts), perc05=apply(norm.counts, 2, quantile, 0.05), perc10=apply(norm.counts, 2, quantile, 0.10), perc90=apply(norm.counts, 2, quantile, 0.90), perc95=apply(norm.counts, 2, quantile, 0.95) ) kable(norm.counts.stats) ``` ```{r mean_variance_plot, fig.path="figures/schurch2016_", fig.width=7, fig.height=7, fig.cap="**Figure: variance/mean plot. ** The brown line highlights $x=y$, which corresponds to the expected relationship between mean and variance for a Poisson distribution. "} ## Mean and variance relationship mean.var.col <- densCols(x=log2(mean.counts), y=log2(variance.counts)) plot(x=log2(mean.counts), y=log2(variance.counts), pch=16, cex=0.5, col=mean.var.col, main="Mean-variance relationship", xlab="Mean log2(normalized counts) per gene", ylab="Variance of log2(normalized counts)", panel.first = grid()) abline(a=0, b=1, col="brown") ``` ## Modeling read counts Let us imagine that we would produce a lot of RNA-Seq experiments from the same samples (technical replicates). For each gene $g$ the measured read counts would be expected to vary rather slighlty around the expected mean and would be probably well modeled using a Poisson distribution. However, when working with biological replicates more variations are intrinsically expected. Indeed, the measured expression values for each genes are expected to fluctuate more importantly, due to the combination of biological and technical factors: inter-individual variations in gene regulation, sample purity, cell-synchronization issues or reponses to environment (e.g. heat-shock). The Poisson distribution has only one parameter indicating its expected mean : $\lambda$. The variance of the distribution equals its mean $\lambda$. Thus in most cases, the Poisson distribution is not expected to fit very well with the count distribution in biological replicates, since we expect some over-dispersion (greater variability) due to biological noise. As a consequence, when working with RNA-Seq data, many of the current approaches for differential expression call rely on an alternative distribution: the *negative binomial* (note that this holds true also for other -Seq approaches, e.g. ChIP-Seq with replicates). ### What is the negative binomial ? The negative binomial distribution is a discrete distribution that give us the probability of observing $x$ failures before a target number of succes $n$ is obtained. As we will see later the negative binomial can also be used to model over-dispersed data (in this case this overdispersion is relative to the poisson model). #### The probability of $x$ failures before $n$ success First, given a Bernouilli trial with a probability $p$ of success, the **negative binomial** distribution describes the probability of observing $x$ failures before a target number of successes $n$ is reached. In this case the parameters of the distribution will thus be $p$, $n$ (in **dnbinom()** function of R, $n$ and $p$ are denoted by arguments **size** and **prob** respectively). $$P_{NegBin}(x; n, p) = \binom{x+n-1}{x}\cdot p^n \cdot (1-p)^x = C^{x}_{x+n-1}\cdot p^n \cdot (1-p)^x $$ In this formula, $p^n$ denotes the probability to observe $n$ successes, $(1-p)^x$ the probability of $x$ failures, and the binomial coefficient $C^{x}_{x+n-1}$ indicates the number of possible ways to dispose $x$ failures among the $x+n-1$ trials that precede the last one (the problem statement imposes for the last trial to be a success). The negative binomial distribution has expected value $n\frac{q}{p}$ and variance $n\frac{q}{p^2}$. Some examples of using this distribution in R are provided below. **Particular case**: when $n=1$ the negative binomial corresponds to the the **geometric distribution**, which models the probability distribution to observe the first success after $x$ failures: $P_{NegBin}(x; 1, p) = P_{geom}(x; p) = p \cdot (1-p)^x$. ```{r begbin_distrib, fig.path="figures/schurch2016_", fig.width=7, fig.height=5, fig.cap="Negative binomial distribution. "} par(mfrow=c(1,1)) ## Some intuition about the negative binomiale parametrized using n and p. ## The simple case, one success (see geometric distribution) # Let's have a look at the density p <- 1/6 # the probability of success n <- 1 # target for number of successful trials # The density function plot(0:10, dnbinom(0:10, n, p), type="h", col="blue", lwd=4) # the probability of zero failure before one success. # i.e the probability of success dnbinom(0, n , p) ## i.e the probability of at most 5 failure before one success. sum(dnbinom(0:5, n , p)) # == pnbinom(5, 1, p) ## The probability of at most 10 failures before one sucess sum(dnbinom(0:10, n , p)) # == pnbinom(10, 1, p) ## The probability to have more than 10 failures before one sucess 1-sum(dnbinom(0:10, n , p)) # == 1 - pnbinom(10, 1, p) ## With two successes ## The probability of x failure before two success (e.g. two six) n <- 2 plot(0:30, dnbinom(0:30, n, p), type="h", col="blue", lwd=2, main="Negative binomial density", ylab="P(x; n,p)", xlab=paste("x = number of failures before", n, "successes")) # Expected value q <- 1-p (ev <- n*q/p) abline(v=ev, col="darkgreen", lwd=2) # Variance (v <- n*q/p^2) arrows(x0=ev-sqrt(v), y0 = 0.04, x1=ev+sqrt(v), y1=0.04, col="brown",lwd=2, code=3, , length=0.2, angle=20) ``` #### Using mean and dispersion The second way of parametrizing the distribution is using the mean value $m$ and the dispersion parameter $r$ (in **dnbinom()** function of R, $m$ and $r$ are denoted by arguments **mu** and **size** respectively). The variance of the distribution can then be computed as $m + m^2/r$. Note that $m$ can be deduced from $n$ and $p$. ```{r} n <- 10 p <- 1/6 q <- 1-p mu <- n*q/p all(dnbinom(0:100, mu=mu, size=n) == dnbinom(0:100, size=n, prob=p)) ``` ### Modelling read counts through a negative binomial To perform diffential expression call DESeq will assume that, for each gene, the read counts are generated by a negative binomial distribution. One problem here will be to estimate, for each gene, the two parameters of the negative binomial distribution: mean and dispersion. * The mean will be estimated from the observed normalized counts in both conditions. * The first step will be to compute a gene-wise dispersion. When the number of available samples is insufficient to obtain a reliable estimator of the variance for each gene, DESeq will apply a **shrinkage** strategy, which assumes that counts produced by genes with similar expression level (counts) have similar variance (note that this is a strong assumption). DESeq will regress the gene-wise dispersion onto the means of the normalized counts to obtain an estimate of the dispersion that will be subsequently used to build the binomial model for each gene. ```{r estimate_dispersion} ## Performing estimation of dispersion parameter dds.disp <- estimateDispersions(dds.norm) ## A diagnostic plot which ## shows the mean of normalized counts (x axis) ## and dispersion estimate for each genes plotDispEsts(dds.disp) ``` ------------------------------------------------------- ## Performing differential expression call Now that a negative binomial model has been fitted for each gene, the **nbinomWaldTest** can be used to test for differential expression. The output is a data.frame which contains nominal p-values, as well as FDR values (correction for multiple tests computed with the Benjamini–Hochberg procedure). ```{r DESeq2_Pvalue_histogram, fig.path="figures/schurch2016_", fig.width=7, fig.height=5, fig.cap="Histogram of the p-values reported by DESeq2. "} alpha <- 0.0001 wald.test <- nbinomWaldTest(dds.disp) res.DESeq2 <- results(wald.test, alpha=alpha, pAdjustMethod="BH") ## What is the object returned by nbinomTest() class(res.DESeq2) is(res.DESeq2) # a data.frame slotNames(res.DESeq2) head(res.DESeq2) ## The column names of the data.frame ## Note the column padj ## contains FDR values (computed Benjamini–Hochberg procedure) colnames(res.DESeq2) ## Order the table by decreasing p-valuer res.DESeq2 <- res.DESeq2[order(res.DESeq2$padj),] head(res.DESeq2) ## Draw an histogram of the p-values hist(res.DESeq2$padj, breaks=20, col="grey", main="DESeq2 p-value distribution", xlab="DESeq2 P-value", ylab="Number of genes") ``` ## Volcano plot ```{r DESeq2_volcano_plot, fig.path="figures/schurch2016_", fig.width=7, fig.height=5, fig.cap="Volcano plot of DESeq2 results. Abcsissa: log2(fold-change). Ordinate: significance ($-log_{10}(P-value)$). "} alpha <- 0.01 # Threshold on the adjusted p-value cols <- densCols(res.DESeq2$log2FoldChange, -log10(res.DESeq2$pvalue)) plot(res.DESeq2$log2FoldChange, -log10(res.DESeq2$padj), col=cols, panel.first=grid(), main="Volcano plot", xlab="Effect size: log2(fold-change)", ylab="-log10(adjusted p-value)", pch=20, cex=0.6) abline(v=0) abline(v=c(-1,1), col="brown") abline(h=-log10(alpha), col="brown") gn.selected <- abs(res.DESeq2$log2FoldChange) > 2 & res.DESeq2$padj < alpha text(res.DESeq2$log2FoldChange[gn.selected], -log10(res.DESeq2$padj)[gn.selected], lab=rownames(res.DESeq2)[gn.selected ], cex=0.4) ``` ## Check the expression levels of the most differentially expressed gene It may be important to check the validity of our analysis by simply assessing the expression level of the most highly differential gene. ```{r selected_gene_barplot, fig.path="figures/schurch2016_", fig.width=7, fig.height=5, fig.cap="Barplot of the counts per sample fr a selected gene. "} gn.most.sign <- rownames(res.DESeq2)[1] gn.most.diff.val <- counts(dds.norm, normalized=T)[gn.most.sign,] barplot(gn.most.diff.val, col=col.pheno.selected, main=gn.most.sign, las=2, cex.names=0.5) ``` ## Looking at the results with a MA plot One popular diagram in dna chip analysis is the M versus A plot (MA plot) between two conditions $a$ and $b$. In this representation : * M (Minus) is the log ratio of counts calculated for any gene. $$M_g = log2(\bar{x}_{g,a}) - log2(\bar{x}_{g,b})$$ * A (add) is the average log counts which corresponds to an estimate of the gene expression level. $$A_g = \frac{1}{2}(log2(\bar{x}_g,a) + log2(\bar{x}_g,b))$$ ```{r DESEeq2_MA_plot, fig.path="figures/schurch2016_", fig.width=7, fig.height=5, fig.cap="MA plot. The abcsissa indicates the mean of normalized counts; the ordinate the log2(fold-change). "} ## Draw a MA plot. ## Genes with adjusted p-values below 1% are shown plotMA(res.DESeq2, colNonSig = "blue") abline(h=c(-1:1), col="red") ``` **************************************************************** ## Hierarchical clustering To ensure that the selected genes distinguish well between "treated"" and "untreated" condition we will perform a hierachical clustering using the **`heatmap.2()`** function from the gplots library. ```{r signif_genes_count_heatmap, fig.path="figures/schurch2016_", fig.width=7, fig.height=5, fig.cap="Heatmap of the gebes deckared significant with DESeq2. Rows correspond to genes, columns to samples. "} ## We select gene names based on FDR (1%) gene.kept <- rownames(res.DESeq2)[res.DESeq2$padj <= alpha & !is.na(res.DESeq2$padj)] ## We retrieve the normalized counts for gene of interest count.table.kept <- log2(count.table + epsilon)[gene.kept, ] dim(count.table.kept) ## Install the gplots library if needed then load it if(!require("gplots")){ install.packages("gplots") } library("gplots") ## Perform the hierarchical clustering with ## A distance based on Pearson-correlation coefficient ## and average linkage clustering as agglomeration criteria heatmap.2(as.matrix(count.table.kept), scale="row", hclust=function(x) hclust(x,method="average"), distfun=function(x) as.dist((1-cor(t(x)))/2), trace="none", density="none", labRow="", cexCol=0.7) ``` ## Functional enrichment We will now perform functional enrichment using the list of induced genes. This step will be performed using the gProfileR R library. ```{r functional_enrichment_gProfileR} library(gProfileR) res.DESeq2.df <- na.omit(data.frame(res.DESeq2)) induced.sign <- rownames(res.DESeq2.df)[res.DESeq2.df$log2FoldChange >= 2 & res.DESeq2.df$padj < alpha] # head(induced.sign) # names(term.induced) term.induced <- gprofiler(query=induced.sign, organism="scerevisiae") term.induced <- term.induced[order(term.induced$p.value),] # term.induced$p.value kable(term.induced[1:10,c("term.name", "term.size", "query.size", "overlap.size", "recall", "precision", "p.value", "intersection")], format.args=c(engeneer=TRUE, digits=3), caption="**Table: functional analysis wit gProfileR. ** ") ``` And now using the list of repressed genes. ```{r} res.DESeq2.df <- na.omit(data.frame(res.DESeq2)) repressed.sign <- rownames(res.DESeq2.df)[res.DESeq2.df$log2FoldChange <= -2 & res.DESeq2.df$padj < alpha] head(repressed.sign) term.repressed <- gprofiler(query=repressed.sign, organism="scerevisiae") term.repressed <- term.repressed[order(term.repressed$p.value),] kable(head(term.induced[,c("p.value", "term.name","intersection")], 10)) ``` ## Assess the effect of sample number on differential expression call Using a loop, randomly select 10 times 2,5,10,15..45 samples from WT and Snf2 KO. Perform differential expression calls and draw a diagram showing the number of differential expressed genes. ```{r save_results} ## Create a directory to store the results that will be obtained below dir.results <- file.path(dir.snf2, "results") dir.create(dir.results, showWarnings = FALSE, recursive = TRUE) ## Export the table with statistics per sample. write.table(stats.per.sample, file=file.path(dir.results, "stats_per_sample.tsv"), quote=FALSE, sep="\t", col.names =NA, row.names = TRUE) # Export the DESeq2 result table DESeq2.table <- file.path(dir.results, "yeast_Snf2_vs_WT_DESeq2_diff.tsv") write.table(res.DESeq2, file=DESeq2.table, col.names = NA, row.names = TRUE, sep="\t", quote = FALSE) ```