--- title: "RNA-Seq - differential expression using DESeq2" author: 'D. Puthier (adapted From Hugo Varet, Julie Auberta and J. van Helden) ' date: 'First version: 2016-12-10; Last update: `r Sys.Date()`' output: html_document: fig_caption: yes highlight: zenburn self_contained: yes theme: cerulean toc: yes toc_depth: 3 toc_float: yes beamer_presentation: colortheme: dolphin fig_caption: yes fig_height: 4 fig_width: 6 fonttheme: structurebold highlight: tango incremental: no keep_tex: no slide_level: 2 theme: Montpellier toc: yes ioslides_presentation: colortheme: dolphin fig_caption: yes fig_height: 4 fig_width: 6 fonttheme: structurebold highlight: tango incremental: no keep_md: no slide_level: 2 smaller: yes theme: cerulean toc: yes widescreen: yes pdf_document: fig_caption: yes highlight: zenburn toc: yes toc_depth: 3 slidy_presentation: fig_caption: yes fig_height: 4 fig_width: 6 highlight: tango incremental: no keep_md: no self_contained: yes slide_level: 2 smaller: yes theme: cerulean toc: yes widescreen: yes font-import: http://fonts.googleapis.com/css?family=Risque font-family: Garamond transition: linear --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, eval = TRUE, cache = TRUE, message = FALSE, warning = FALSE, comment = "") ## Options to display pretty numbers library(knitr) knit_hooks$set(inline = function(x) { prettyNum(x, big.mark=" ") }) options(scipen = 6, digits = 3) knitr::asis_output("\\footnotesize") # Load gProfileR (install first if not done yet) if (!require(gProfileR)) { install.packages("gProfileR") } library(gProfileR) ``` **************************************************************** # The Snf2 dataset The RNA-Seq dataset we will use in this practical has been produced by Gierliński *et al* ([@pmid26206307, @pmid27022035]). The dataset is composed of 48 WT yeast samples vs 48 Snf2 knock-out mutant cell line. The prepared RNA-Seq libraries (unstranded) were pooled and sequenced on seven lanes of a single flow-cell on an Illumina HiSeq 2000 resulting in a total of 1 billion 50-bp single-end reads across the 96 samples. RNA-Seq reads have been cleaned, mapped and counted to generated a count data matrix containing 7126 rows/genes. The primary objective of this study was to check whether the observed gene read counts distribution where consistent with theorical models (e.g. negative binomial). More information can be obtained in the original paper ([pdf](http://bioinformatics.oxfordjournals.org/content/early/2015/08/17/bioinformatics.btv425.full.pdf)) # Loading the dataset **R** enables to download data directly from the Web. The expression matrix and phenotypic information will be loaded into **R** using the **read.table** function. Both table will be converted into a data.frame object when loaded into R. The 'count.table' object will contains counts for each gene (row) and each sample (column). ```{r} # Download data files from the Web site (only if not done yet) url.counts <- "http://jvanheld.github.io/stats_avec_RStudio_EBA/practicals/yeast_2x48_replicates/data/" ## Local paths: create a directory to store the results dir.snf2 <- ("~/ASG/practicals/rnaseq_snf2_Schurch_2015") dir.counts <- file.path(dir.snf2, "data") file.counts <- file.path(dir.counts, "counts.txt") file.expDesign <- file.path(dir.counts, "expDesign.txt") ## Create a directory to download the dataset if it does not exist dir.create(dir.counts, showWarnings = FALSE, recursive = TRUE) ## Download the data files if required if (!file.exists(file.counts)) { message("Downloading count table from ", url.counts) download.file(url=file.path(url.counts, "counts.txt"), destfile = file.counts) } if (!file.exists(file.expDesign)) { message("Downloading design table from ", url.counts) download.file(url=file.path(url.counts, "expDesign.txt"), destfile = file.expDesign) } # Load the count table count.table <- read.table(file=file.counts, sep="\t", header=TRUE, row.names=1) # View(count.table) ``` # Phenotypic data The dataset contains RNA-Seq count data for a wild type strain (**WT**) and a **Snf2** mutant, with 48 biological replicates for each genotype. All phenotypic informations are enclosed in a dedicated file. Note that the produced data.frame encodes the 'strains' columns as a factor^[A factor is a vector with levels (categories), which permits an efficient storage and indexing, but can in some cases lead to misleading effects. To circumvent this, we will sometimes have to convert the factor to a vector, with the R command `as.vector()`. ]. ```{r experimental_design} # Load experimental design file expDesign <- read.table(file=file.expDesign, sep="\t", header=TRUE) #View(expDesign) # Check the first and last line of the phenotypic data head(expDesign) tail(expDesign) ## Count the number of sample in each class table(expDesign$strain) ## Define a strain-specific color for each sample, ## and add it as a supplementary column to the phenotypic data col.strain <- c(WT="green", Snf="orange") # Choose one color per strain expDesign$color <- col.strain[as.vector(expDesign$strain)] ```
- Draw a barplot showing the number of reads in each sample. Use either the `colSums()` or the `apply()` function (run `help(colSums()` if you don't know this function). - What can you say from this diagram?
View solution| Hide solution # Descriptive statistics ## Basic statistics Before going further in the analysis, we will compute some descriptive statistics on the dataset. At this stage we only compute statistics per sample, since statistics per gene are meaningful only after library-wise normalization of the counts. ```{r stats_per_sample} ## Dimensions nb.samples <- ncol(count.table) print(nb.samples) nb_genes <- nrow(count.table) print(nb_genes) dim(count.table) ## Min, Max, median (...). ## Here on the first 4 samples head(summary(count.table[,1:4])) ## A magic trick to convert column-wise summaries into a data.frame. ## The do.call() function produces a data frame with one col per sample, ## we transpose it to obtain one row per sample and one column per statistics. stats.per.sample <- data.frame(t(do.call(cbind, lapply(count.table, summary)))) head(stats.per.sample) ## We can now add some columns to the stats per sample stats.per.sample$libsum <- apply(count.table, 2, sum) ## libsum # Add some percentiles stats.per.sample$perc05 <- apply(count.table, 2, quantile, 0.05) stats.per.sample$perc10 <- apply(count.table, 2, quantile, 0.10) stats.per.sample$perc90 <- apply(count.table, 2, quantile, 0.90) stats.per.sample$perc95 <- apply(count.table, 2, quantile, 0.95) stats.per.sample$zeros <- apply(count.table==0, 2, sum) stats.per.sample$percent.zeros <- 100*stats.per.sample$zeros/nrow(count.table) # View(stats.per.sample) kable(stats.per.sample[sample(1:ncol(count.table), size = 10),], caption = "**Table: statistics per sample. ** We only display a random selection of 10 samples. ") ``` ## Distributions ### Histograms of counts per gene The summary only displays a few milestone values (mean, median, quartiles). In order to get a better intuition of the data, we can draw an histogram of all values. ```{r data_distrib_histogram, fig.path="figures/schurch2016_", fig.width=7, fig.height=9, fig.cap="Histogram of counts per genes. **Top: raw counts. ** the scale is determined by the gene with the highest count, which is apparently an outlier. **Middle: ** raw counts, with X axis truncated to 2000 in order to display a representative range despite outliers. **Bottom: ** log2-transformed counts (bottom) per gene, with a pseudocount of 1 to avoid minus infinitevalues resulting from zero counts. "} par(mfrow=c(3,1)) hist(as.matrix(count.table), col="blue", border="white", breaks=100) hist(as.matrix(count.table), col="blue", border="white", breaks=20000, xlim=c(0,2000), main="Counts per gene", xlab="Counts (truncated axis)", ylab="Number of genes", las=1, cex.axis=0.7) epsilon <- 1 # pseudo-count to avoid problems with log(0) hist(as.matrix(log2(count.table + epsilon)), breaks=100, col="blue", border="white", main="Log2-transformed counts per gene", xlab="log2(counts+1)", ylab="Number of genes", las=1, cex.axis=0.7) par(mfrow=c(1,1)) ``` ### Interpretation - The top histogram is not very informative so far, apparently due to the presence of a few very high count values, that impose a very large scale on the $X$ axis. - The middle histogram shows the representative range. Note the height of the first bin, which includes the zero counts. - The logarithmic transformation (bottom histogram) improves the readability. Note that we added a pseudo-count of `r epsilon` to avoid problems with the log transformation of zero counts (which gives $-\infty$). ### Boxplots of gene count distributions per sample To get better insights into the distribution per sample, *boxplots* offer a good perspective. ```{r log2_counts_boxplots, fig.path="figures/schurch2016_", fig.width=7, fig.height=10, fig.cap="Box plots of non-normalized log2(counts) per sample. "} ## Boxplots boxplot(log2(count.table + epsilon), col=expDesign$color, pch=".", horizontal=TRUE, cex.axis=0.5, las=1, ylab="Samples", xlab="log2(Counts +1)") ``` ### Density plots Another way to get an intuition of the value distributions is to use the *geom_density()* function (ggplot2 library), which draws one density curve for each sample. ```{r log2_counts_densities_per_sample, fig.path="figures/schurch2016_", fig.width=7, fig.height=5, fig.cap="Densities of log2(counts). Each curve corresponds to one sample. "} ## Density ## We will require some functions from the reshape2 and ggplot2 packages if(!require("reshape2")){ install.packages("reshape2") } if(!require("ggplot2")){ install.packages("ggplot2") } library(ggplot2) library(reshape2) count_melt <- reshape2::melt(log2(count.table + epsilon)) head(count_melt) ggplot(data=count_melt, mapping=aes(x=value, color=variable)) + geom_density() ``` **Beware**: the R function *geom_density()* does not display the actual distribution of your values, but a polynomial fit. The representation thus generally looks smoother than the actual data. It is important to realize that, in some particular cases, the fit can lead to extrapolated values which can be misleading. ## Scatter plots ```{r scatter_plot_some_samples, fig.path="figures/schurch2016_", fig.width=8, fig.height=8, fig.cap="**Scatter plot of log2-counts for a random selection of samples. **"} nb.pairs <- 6 ## Define a function to draw a scatter plot for a pair of variables (samples) with density colors plotFun <- function(x,y){ dns <- densCols(x,y); points(x,y, col=dns, pch=".", panel.first=grid()); # abline(a=0, b=1, col="brown") } ## Plot the scatter plot for a few pairs of variables selected at random set.seed(123) # forces the random number generator to produce fixed results. Should generally not be used, except for the sake of demonstration with a particular selection. pairs(log2(count.table[,sample(ncol(count.table), nb.pairs)] + epsilon), panel=plotFun, lower.panel = NULL) ``` Let's have a look at the scatter plots using the *pairs()* function. We will only represent `r nb.pairs` randomly selected samples. The command *pairs()* draws a scatter plot for each pair of columns of the input dataset. The plot shows a fairly good reproducibility between samples of the same type (WT or KO, respectively): all points are aligned along te diagonal, with a relatively wider dispersion at the bottom, corresponding to small number fluctuations. In contrast, on all the plots comparing a WT and a KO sample, we can see some points (genes) discarding from the diagonal. # Eliminating undetected genes All genes from genome the *S. cerevisiae* where quantified. However only a fraction of them where expressed and some of them where to weakly expressed to be detected in any of the sample. As a result the count table may contain rows with only zero values (null counts).
- What is the percentage of gene having null counts per sample. Draw a barplot. - Some genes were not detected in any of the sample. Count their number, and delete them from the **count.table** data.frame.
View solution| Hide solution