The RNA-Seq dataset we will use in this practical has been produced by Gierliński et al, 2015) and (Schurch et al, 2016)).
The dataset is composed of 48 samples of yeast wild-type (WT) strain, and 48 samples of Snf2 knock-out mutant cell line. The prepared RNA-Seq libraries (unstranded) were pooled and sequenced on seven lanes of a single flow-cell on an Illumina HiSeq 2000 resulting in a total of 1 billion 50-bp single-end reads across the 96 samples. RNA-Seq reads have been cleaned, mapped and counted to generated a count data matrix containing \(7126\) rows/genes.
In this tutorial, we will start from the count table generated by the authors of the study, and use different R tools in order to
R enables to download data directly from the Web with the download.file function. After, we begin with some verification steps.
The expression matrix and phenotypic information will be loaded into R using the read.table function. Both table will be converted into a data.frame object when loaded into R.
The countTable object should contain counts for each gene (row) and each sample (column).
The phenoTable object should contain the designation of each genotype, that is 48 biological replicates of the wild type strain (WT) and the Snf2 mutant.
Note that the created tables include column names as factor object.
Reminder: a factor is a vector with levels (categories), which permits an efficient storage and indexing, but can in some cases lead to misleading effects. To circumvent this, we will sometimes have to convert the factor to a vector, with the R command as.vector().
WT1 WT2 WT3 WT4 WT5
yal034w-a 78 116 103 113 113
yal035w 2815 4160 2969 4411 4836
yal036c 331 599 348 525 691
yal037c-a 0 37 23 15 51
yal037c-b 70 183 199 208 196
yal037w 22 69 32 58 51
yal038w 21080 18869 14649 28551 20208
yal039c 1541 1447 1318 2251 1501
[1] "strain" "color"
# Load the count table
countTable <- read.table("http://jvanheld.github.io/stats_avec_RStudio_EBA/practicals/yeast_2x48_replicates/data/counts.txt", header=TRUE, row.names=1)
# View part of the data/table
countTable[101:108, 1:5]
# Load the sample description file
# (traditionally named "pheno" for "phenotype", athough in our case the conditions are genotypes)
phenoTable <- read.table("http://jvanheld.github.io/stats_avec_RStudio_EBA/practicals/yeast_2x48_replicates/data/expDesign.txt", header=TRUE, row.names=1)
# Define a strain-specific color for each sample, and add it as a supplementary column
strainColor <- c("WT"="green","Snf2"="orange") # Choose one color per strain
print(strainColor)
## Add a column to the pheno type to specify the color of each sample according to its genotype
phenoTable$color[phenoTable$strain == "Snf"] <- strainColor["Snf2"]
phenoTable$color[phenoTable$strain == "WT"] <- strainColor["WT"]
# Check the dimensions of the pheno table
colnames(phenoTable)
Check the dimensions of the data (tables).
## Quick test with a small table generated on the flight
testTable <- data.frame(
ABC = sample(c("A", "B", "C", "D", "E"), replace=TRUE, 20),
gender = sample(c("boy", "girl", "robot"), replace=TRUE, 20),
yesno = sample(c("yes", "no"), replace=TRUE, 20))
Before going further in the analysis, we will compute some descriptive statistics on the dataset. At this stage we only compute statistics per sample, since statistics per gene are meaningful only after library-wise normalization of the counts.
# Min, Max, median (...).
# Here on the first 4 samples
summary(countTable[,1:4])
Among the reference functions in R, apply (and its derivates) allow to repeated execution of a function to a serie of data, at a noteworthy higher speed tha using for loops. Note that, if working on a matrix/data frame, the apply function requires that you indicate if you want to apply the function to the rows (MARGIN=1) or to the columns (MARGIN=2). Some built-in functions already exist as colMeans, colSums, rowMeans and rowSums. Let’s use apply to get the summary for every column of the count table. It produces a matrix as output that we transpose to obtain samples as rows and the statistics as columns, and move back to a data.frame object.
statsPerSample <- data.frame(t(apply(countTable, 2, summary)))
head(statsPerSample)
Min. X1st.Qu. Median Mean X3rd.Qu. Max.
WT1 0 49.0 224 837 561 189000
WT2 0 88.0 371 1110 853 197000
WT3 0 74.2 297 903 670 172000
WT4 0 104.0 430 1460 1020 329000
WT5 0 84.0 353 1120 819 225000
WT6 0 153.0 667 2050 1600 357000
# Redefine column nemes for readability
names(statsPerSample) <- c("min", "Q1", "median", "mean", "Q3", "max")
kable(head(statsPerSample))
min | Q1 | median | mean | Q3 | max | |
---|---|---|---|---|---|---|
WT1 | 0 | 49.0 | 224 | 837 | 561 | 189000 |
WT2 | 0 | 88.0 | 371 | 1110 | 853 | 197000 |
WT3 | 0 | 74.2 | 297 | 903 | 670 | 172000 |
WT4 | 0 | 104.0 | 430 | 1460 | 1020 | 329000 |
WT5 | 0 | 84.0 | 353 | 1120 | 819 | 225000 |
WT6 | 0 | 153.0 | 667 | 2050 | 1600 | 357000 |
We can now add some columns to the stats per sample. For example, here are the library sums and the 5% quantile.
statsPerSample$libsum <- colSums(countTable) # library sums
statsPerSample$perc05 <- apply(countTable, 2, quantile, probs=0.05)
Integrate some more columns to statsPerSample: - the 10%, 90% and 95% quantile;
View solution|Hide solutionFinally, randomly choose (with the sample()
function) 10 samples (rows) of the stats per sample table and look at their statistics.
statsPerSample[sample(1:ncol(countTable), size = 10),]
We show hereafter the statistics per sample (subset only).
min | Q1 | median | mean | Q3 | max | libsum | perc05 | perc10 | perc90 | perc95 | zeros | percZeros | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Snf26 | 0 | 124.0 | 506 | 1380 | 1140 | 210000 | 9812933 | 0 | 2392 | 2530 | 2526 | 467 | 6.55 |
Snf2 | 0 | 98.0 | 422 | 1220 | 960 | 200000 | 8694091 | 0 | 1977 | 2683 | 3066 | 552 | 7.75 |
Snf3 | 0 | 98.0 | 408 | 1230 | 954 | 210000 | 8773138 | 0 | 3316 | 4570 | 5280 | 533 | 7.48 |
WT8 | 0 | 80.0 | 360 | 1380 | 900 | 343000 | 9839118 | 0 | 1624 | 1738 | 1868 | 576 | 8.08 |
Snf43 | 0 | 119.0 | 476 | 1340 | 1060 | 219000 | 9568585 | 0 | 2 | 1 | 3 | 482 | 6.76 |
WT37 | 0 | 75.0 | 344 | 1550 | 901 | 468000 | 11009738 | 0 | 2 | 2 | 2 | 586 | 8.22 |
WT19 | 0 | 84.0 | 364 | 1310 | 879 | 311000 | 9326815 | 0 | 2 | 1 | 2 | 564 | 7.92 |
WT39 | 0 | 61.0 | 280 | 1040 | 689 | 247000 | 7408720 | 0 | 4056 | 5098 | 3496 | 627 | 8.80 |
WT10 | 0 | 80.2 | 360 | 1110 | 866 | 188000 | 7920887 | 0 | 2 | 2 | 2 | 547 | 7.68 |
Snf11 | 0 | 109.0 | 458 | 1390 | 1050 | 247000 | 9907483 | 0 | 1524 | 2802 | 2181 | 514 | 7.21 |
The summary only displays a few milestone values (mean, median, quartiles). In order to get a better intuition of the data, we can draw an histogram of all values.
par(mfrow=c(3,1))
hist(as.matrix(countTable), col="blue", border="white", breaks=100)
hist(as.matrix(countTable), col="blue", border="white",
breaks=20000, xlim=c(0,2000), main="Counts per gene",
xlab="Counts (truncated axis)", ylab="Number of genes",
las=1, cex.axis=0.7)
epsilon <- 1 # pseudo-count to avoid problems with log(0)
hist(as.matrix(log2(countTable + epsilon)), breaks=100, col="blue", border="white",
main="Log2-transformed counts per gene", xlab="log2(counts+1)", ylab="Number of genes",
las=1, cex.axis=0.7)
par(mfrow=c(1,1))
To get better insights into the distribution per sample, the boxplot offer a good perspective.
boxplot(log2(countTable + epsilon), col=phenoTable$color, pch=".",
horizontal=TRUE, cex.axis=0.5,
las=1, ylab="Samples", xlab="log2(counts +1)")
Another way to get an intuition of the value distributions is to use the plotDensity function, which draws one density curve for each sample.
# We will require one function from the affy package
if(!require("affy")){
source("http://bioconductor.org/biocLite.R")
biocLite("affy")
}
library(affy)
plotDensity(log2(countTable + epsilon), lty=1, col=phenoTable$color, lwd=2)
grid()
legend("topright", legend=names(strainColor), col=strainColor, lwd=2)
Beware: the R function plotDensity does not display the actual distribution of your values, but a polynomial fit. The representation thus generally looks smoother than the actual data. It is important to realize that, in some particular cases, the fit can lead to extrapolated values which can be misleading.
Let’s have a look at the scatter plots using the pairs function. We will only represent 10 randomly selected samples.
# Define a function to draw a scatter plot for a pair of variables (samples) with density colors
plotFun <- function(x,y){
dns <- densCols(x,y);
points(x,y, col=dns, pch=".", panel.first=grid());
# abline(a=0, b=1, col="brown")
}
# Plot the scatter plot for a few pairs of variables selected at random
set.seed(123) # forces the random number generator to produce fixed results
pairs(log2(countTable[,sample(ncol(countTable),5)] + epsilon),
panel=plotFun, lower.panel = NULL)
The command pairs draws a scatter plot for each pair of columns of the input dataset. The plot shows a fairly good reproducibility between samples of the same type (WT or KO, respectively): all points are aligned along te diagonal, with a relatively wider dispersion at the bottom, corresponding to small number fluctuations.
In contrast, on all the plots comparing a WT and a KO sample, we can see some points (genes) discarding from the diagonal.
All genes from genome the S. cerevisiae where quantified. However only a fraction of them where expressed and some of them where to weakly expressed to be detected in any of the sample. As a result the count table may contain rows with only zero values (null counts).
One of the questions that will drive the analysis will be to define the impact of the number of sample on the analysis.
The original study contained 48 replicates per genotype, what happens if we select a smaller number?
In a first time, we will analyse the full dataset with 48 samples per genotype.
In the second part of the tutorial, each attendee of this course will select a given number (e.g. 3, 4, 5, 10, 15, 20, 35, 40, 45, …) and adapt the code below run the analysis with that number of replicates per genotype.
We will at the end then compare the results (number of genes, significance, …) between the full dataset analysis, and the analysis of sub-samples of varying sizes.
We will already select here the subset of yeast samples for our sub-sampling analysis.
## Select a subset of samples from each genotype for the sub-sampling analysis
nb.replicates <- 10 # Each attendee chooses a number (3,4,5,10,15 or 20)
samples.WT <- sample(1:48, size=nb.replicates, replace=FALSE)
# Random sampling of the Snf2 replicates (columns 49 to 96)
samples.Snf2 <- sample(49:96, size=nb.replicates, replace=FALSE)
selected.samples <- c(samples.WT, samples.Snf2)
# Don't forget to update colors
col.pheno.selected <- phenoTable$color[selected.samples]
In this section we will search for genes whose expression is affected by the genetic invalidation. You will first need to install the DESeq2 bioconductor library then load it.
# Install the library if needed then load it
if(!require("DESeq2")){
install.packages("lazyeval")
install.packages("ggplot2")
source("http://bioconductor.org/biocLite.R")
biocLite("DESeq2")
}
library("DESeq2")
We will first analyse together the full dataset. The analysis of subsets of samples will be done as an exercise.
We will then create a DESeqDataSet using the DESeqDataSetFromMatrix function. Get some help about the DESeqDataSet and have a look at some important accessor methods: counts, conditions, estimateSizeFactors, sizeFactors, estimateDispersions and nbinomTest.
# Use the DESeqDataSetFromMatrix to create a DESeqDataSet object
dds0 <- DESeqDataSetFromMatrix(countData = countTable, colData = phenoTable, design = ~ strain)
## Check the general properties of the DESeq dataset
print(dds0)
class: DESeqDataSet
dim: 6887 96
metadata(1): version
assays(1): counts
rownames(6887): 15s_rrna 21s_rrna ... ty(gua)m2 ty(gua)o
rowData names(0):
colnames(96): WT1 WT2 ... Snf47 Snf48
colData names(2): strain color
# What kind of object is it ?
is(dds0)
[1] "DESeqDataSet" "RangedSummarizedExperiment"
[3] "SummarizedExperiment" "Vector"
[5] "Annotated"
isS4(dds0)
[1] TRUE
# What does it contain ?
# The list of slot names
slotNames(dds0)
[1] "design" "dispersionFunction" "rowRanges"
[4] "colData" "assays" "NAMES"
[7] "elementMetadata" "metadata"
# Get some help about the "CountDataSet" class.
# NOT RUN
#?"DESeqDataSet-class"
The normalization procedure (RLE) is implemented through the estimateSizeFactors function.
Note: this section is somewhat technical. If the underlying mathematics seem a bit too complicated do not worry, the next sections will be practice-oriented.
Given a matrix with \(p\) columns (samples) and \(n\) rows (genes) this function estimates the size factors as follows.
The scaling factor for sample \(j\) is thus obtained as:
\[sf_{j} = median(\frac{K_{g,j}}{(\prod_{j=1}^p K_{g,j})^{1/p}})\]
# Let's implement such a function
# cds is a countDataset
estimSf <- function (cds){
# Get the count matrix
cts <- counts(cds)
# Compute the geometric mean
geomMean <- function(x) prod(x)^(1/length(x))
# Compute the geometric mean over the line
gm.mean <- apply(cts, 1, geomMean)
# Zero values are set to NA (avoid subsequentcdsdivision by 0)
gm.mean[gm.mean == 0] <- NA
# Divide each line by its corresponding geometric mean
# sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
# MARGIN: 1 or 2 (line or columns)
# STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
# FUN: the function to be applied
cts <- sweep(cts, 1, gm.mean, FUN="/")
# Compute the median over the columns
med <- apply(cts, 2, median, na.rm=TRUE)
# Return the scaling factor
return(med)
}
Now, check that the results obtained with our function are the same as those produced by DESeq. The method associated with normalization for the “CountDataSet” class is estimateSizeFactors.
# Normalizing using the method for an object of class"CountDataSet"
dds.norm <- estimateSizeFactors(dds0)
sizeFactors(dds.norm)
WT1 WT2 WT3 WT4 WT5 WT6 WT7 WT8 WT9 WT10 WT11 WT12
0.586 0.925 0.751 1.109 0.883 1.687 0.896 0.940 0.889 0.911 0.643 0.836
WT13 WT14 WT15 WT16 WT17 WT18 WT19 WT20 WT21 WT22 WT23 WT24
1.022 1.100 0.782 0.550 0.885 0.878 0.939 0.631 1.053 1.444 0.870 0.910
WT25 WT26 WT27 WT28 WT29 WT30 WT31 WT32 WT33 WT34 WT35 WT36
1.194 0.974 1.049 1.178 0.877 1.258 0.939 0.644 0.977 0.828 0.769 0.781
WT37 WT38 WT39 WT40 WT41 WT42 WT43 WT44 WT45 WT46 WT47 WT48
0.922 0.647 0.714 1.277 1.063 0.803 0.739 0.691 1.166 0.604 1.272 0.854
Snf1 Snf2 Snf3 Snf4 Snf5 Snf6 Snf7 Snf8 Snf9 Snf10 Snf11 Snf12
1.395 1.086 1.078 1.447 0.875 1.374 1.717 0.967 1.270 0.977 1.186 0.970
Snf13 Snf14 Snf15 Snf16 Snf17 Snf18 Snf19 Snf20 Snf21 Snf22 Snf23 Snf24
1.253 1.278 0.906 1.234 1.275 1.118 0.917 1.124 1.199 0.870 0.997 1.674
Snf25 Snf26 Snf27 Snf28 Snf29 Snf30 Snf31 Snf32 Snf33 Snf34 Snf35 Snf36
0.911 1.312 0.822 0.966 1.076 0.937 0.859 1.265 1.559 1.133 0.953 0.974
Snf37 Snf38 Snf39 Snf40 Snf41 Snf42 Snf43 Snf44 Snf45 Snf46 Snf47 Snf48
1.078 0.988 1.958 1.073 1.225 0.850 1.221 1.257 0.989 1.146 1.229 0.976
# Now get the scaling factor with our homemade function.cds.norm
head(estimSf(dds0))
WT1 WT2 WT3 WT4 WT5 WT6
0.543 0.884 0.725 1.062 0.841 1.603
all(round(estimSf(dds0),6) == round(sizeFactors(dds.norm), 6))
[1] FALSE
# Checking the normalization
par(mfrow=c(2,2),cex.lab=0.7)
boxplot(log2(counts(dds.norm)+epsilon), col=col.pheno.selected, cex.axis=0.7,
las=1, xlab="log2(counts+1)", horizontal=TRUE, main="Raw counts")
boxplot(log2(counts(dds.norm, normalized=TRUE)+epsilon), col=col.pheno.selected, cex.axis=0.7,
las=1, xlab="log2(normalized counts)", horizontal=TRUE, main="Normalized counts")
plotDensity(log2(counts(dds.norm)+epsilon), col=col.pheno.selected,
xlab="log2(counts+1)", cex.lab=0.7, panel.first=grid())
plotDensity(log2(counts(dds.norm, normalized=TRUE)+epsilon), col=col.pheno.selected,
xlab="log2(normalized counts)", cex.lab=0.7, panel.first=grid())
# Restore default parameters
par(mfrow=c(1,1), cex.lab=1)
Let us imagine that we would produce a lot of RNA-Seq experiments from the same samples (technical replicates). For each gene \(g\) the measured read counts would be expected to vary rather slighlty around the expected mean and would be probably well modeled using a Poisson distribution. However, when working with biological replicates more variations are intrinsically expected. Indeed, the measured expression values for each genes are expected to fluctuate more importantly, due to the combination of biological and technical factors: inter-individual variations in gene regulation, sample purity, cell-synchronization issues or reponses to environment (e.g. heat-shock).
The Poisson distribution has only one parameter indicating its expected mean : \(\lambda\). The variance of the distribution equals its mean \(\lambda\). Thus in most cases, the Poisson distribution is not expected to fit very well with the count distribution in biological replicates, since we expect some over-dispersion (greater variability) due to biological noise.
As a consequence, when working with RNA-Seq data, many of the current approaches for differential expression call rely on an alternative distribution: the negative binomial (note that this holds true also for other -Seq approaches, e.g. ChIP-Seq with replicates).
The negative binomial distribution is a discrete distribution that gives us the probability of observing \(x\) failures before a given number of successes \(n\) is obtained.
The negative binomial has a variance larger than its mean, and it therefore a popular way to model the typical over-dispersion of the RNA-seq data (in this case this overdispersion is relative to the poisson model).
First, given a Bernouilli trial with a probability \(p\) of success, the negative binomial distribution describes the probability of observing \(x\) failures before a target number of successes \(n\) is reached. In this case the parameters of the distribution will thus be \(p\), \(n\) (in dnbinom function of R, \(n\) and \(p\) are denoted by arguments size and prob respectively).
\[P_{NegBin}(x; n, p) = \binom{x+n-1}{x}\cdot p^n \cdot (1-p)^x = C^{x}_{x+n-1}\cdot p^n \cdot (1-p)^x \]
In this formula, \(p^n\) denotes the probability to observe \(n\) successes, \((1-p)^x\) the probability of \(x\) failures, and the binomial coefficient \(C^{x}_{x+n-1}\) indicates the number of possible ways to dispose \(x\) failures among the \(x+n-1\) trials that precede the last one (the problem statement imposes for the last trial to be a success).
The negative binomial distribution has expected value \(n\frac{q}{p}\) where \(q=1p\) is the probability of failure at each trial.
Its variance is \(n\frac{q}{p^2}\) and is thus by definition smaller tha the mean (except in the trivial case where \(p=1\)). We show below some examples of this distribution.
Particular case: when \(n=1\) the negative binomial corresponds to the the geometric distribution, which models the probability distribution to observe the first success after \(x\) failures: \(P_{NegBin}(x; 1, p) = P_{geom}(x; p) = p \cdot (1-p)^x\).
par(mfrow=c(3,2))
# Some intuition about the negative binomiale parametrized using n and p.
# The simple case, one success (see geometric distribution)
# Let's have a look at the density
p <- 1/6 # the probability of success
n <- 1 # target for number of successful trials
x <- 1:30
# the probability of zero failure before one success.
# i.e the probability of success
dnbinom(0, n , p)
[1] 0.167
# Probability of at most 5 failure before the first success.
sum(dnbinom(0:5, n , p)) # == pnbinom(5, 1, p)
[1] 0.665
# The probability of at most 10 failures before one sucess
sum(dnbinom(0:10, n , p)) # == pnbinom(10, 1, p)
[1] 0.865
# The probability to have more than 10 failures before one sucess
1-sum(dnbinom(0:10, n , p)) # == 1 - pnbinom(10, 1, p)
[1] 0.135
################################################################
# With two successes.
# The probability of x failure before two success (e.g. two six)
n <- 5
p <- 1/6 # the probability of success
for (n in c(1,2,5, 10, 20, 100)) {
# Expected value
q <- 1-p
(ev <- n*q/p)
# Variance
(v <- n*q/p^2)
# Plot the negative binomial density
X <- 0:max(30, ceiling(2.5*ev))
dX <- dnbinom(X, n, p)
dX.pois <- dpois(X, ev)
maxY <- max(dX, dX.pois)
plot(X, dX, type="h", col="blue", lwd=2,
main=paste("Negative binomial density, n =", n),
ylab="P(x; n,p)", ylim=c(0, maxY*1.3),
xlab=paste("x = number of failures before", n, "successes"))
# Plot the Poisson density
lines(X, dX.pois, col="brown", lwd=1)
# Plot the geometric density if n=1
if (n==1) {
dX.geom <- dgeom(X, prob=p)
lines(X, dX.geom, col="darkviolet", lwd=1)
}
# Plot the expected value
abline(v=ev, col="darkgreen", lwd=2)
# Draw an arrow at +- 1stdev for negative binomial
arrows(x0=ev-sqrt(v), y0=maxY*1.1,
x1=ev+sqrt(v), y1=maxY*1.1,
col="darkblue",lwd=2, code=3, length=0.05, angle=20)
# Draw an arrow at +- 1stdev for the Poisson
arrows(x0=ev-sqrt(ev), y0=maxY*1.2,
x1=ev+sqrt(ev), y1=maxY*1.2,
col="brown",lwd=2, code=3, length=0.05, angle=20)
# Add a legend
legend("topright",
lwd=2,
col=c("blue", "darkgreen", "darkblue", "brown"),
c(paste("negbin, n =", n),
paste("mean =", ev),
paste("negbin sd =", round(digits=2, sqrt(v))),
paste("Poisson sd =", round(digits=2, sqrt(ev)))
), cex=0.7
)
}
par(mfrow=c(1,1))
The negative binomial density (blue bars on the Figure) indicates the probability to observe exactly \(x\) failures before the \(n^{th}\) success, in a Bernoulli schema with probability of success \(p\).
The negative binomial is over-dispersed relative to a Poisson distribution (brown line) with the same expected value (green vertical bar).
The second way of parametrizing the distribution is using the mean value \(m\) and the dispersion parameter \(r\) (in dnbinom function of R, \(m\) and \(r\) are denoted by arguments mu and size respectively). The variance of the distribution can then be computed as \(m + m^2/r\).
Note that \(m\) can be deduced from \(n\) and \(p\).
n <- 10
p <- 1/6
q <- 1-p
mu <- n*q/p
all(dnbinom(0:100, mu=mu, size=n) == dnbinom(0:100, size=n, prob=p))
[1] TRUE
To perform diffential expression call DESeq will assume that, for each gene, the read counts are generated by a negative binomial distribution. One problem here will be to estimate, for each gene, the two parameters of the negative binomial distribution: mean and dispersion.
The mean will be estimated from the observed normalized counts in both conditions.
The first step will be to compute a gene-wise dispersion. When the number of available samples is insufficient to obtain a reliable estimator of the variance for each gene, DESeq will apply a shrinkage strategy, which assumes that counts produced by genes with similar expression level (counts) have similar variance (note that this is a strong assumption). DESeq will regress the gene-wise dispersion onto the means of the normalized counts to obtain an estimate of the dispersion that will be subsequently used to build the binomial model for each gene.
# Performing estimation of dispersion parameter
dds.disp <- estimateDispersions(dds.norm)
# A diagnostic plot which
# shows the mean of normalized counts (x axis)
# and dispersion estimate for each genes
plotDispEsts(dds.disp)
Now that a negative binomial model has been fitted for each gene, the nbinomWaldTest can be used to test for differential expression. The output is a data.frame which contains nominal p-values, as well as FDR values (correction for multiple tests computed with the Benjamini–Hochberg procedure).
alpha <- 0.0001
waldTestResult <- nbinomWaldTest(dds.disp)
resultDESeq2 <- results(waldTestResult, alpha=alpha, pAdjustMethod="BH")
# What is the object returned by nbinomTest()
class(resultDESeq2)
[1] "DESeqResults"
attr(,"package")
[1] "DESeq2"
is(resultDESeq2) # a data.frame
[1] "DESeqResults" "DataFrame" "DataTable" "SimpleList"
[5] "DataTableORNULL" "List" "Vector" "Annotated"
slotNames(resultDESeq2)
[1] "rownames" "nrows" "listData" "elementType"
[5] "elementMetadata" "metadata"
head(resultDESeq2)
log2 fold change (MAP): strain WT vs Snf
Wald test p-value: strain WT vs Snf
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
15s_rrna 18.34 -0.1453 0.291 -0.499 0.61806723393
21s_rrna 107.33 0.0843 0.252 0.335 0.73780434012
hra1 2.53 0.7432 0.208 3.569 0.00035825052
icr1 141.57 -0.2149 0.037 -5.816 0.00000000601
lsr1 207.53 0.1322 0.153 0.864 0.38740549424
nme1 26.48 -0.0823 0.131 -0.630 0.52852462875
padj
<numeric>
15s_rrna 0.6738989809
21s_rrna 0.7826231668
hra1 0.0006499904
icr1 0.0000000163
lsr1 0.4497721567
nme1 0.5894985211
# The column names of the data.frame
# Note the column padj
# contains FDR values (computed Benjamini–Hochberg procedure)
colnames(resultDESeq2)
[1] "baseMean" "log2FoldChange" "lfcSE" "stat"
[5] "pvalue" "padj"
# Order the table by decreasing p-valuer
resultDESeq2 <- resultDESeq2[order(resultDESeq2$padj),]
head(resultDESeq2)
log2 fold change (MAP): strain WT vs Snf
Wald test p-value: strain WT vs Snf
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
yar009c 1283 2.43 0.0464 52.3 0 0
yar071w 1674 4.24 0.0472 89.7 0 0
ybl005w 542 1.24 0.0261 47.7 0 0
ybl005w-b 1449 2.17 0.0389 55.7 0 0
ybl100w-b 454 1.23 0.0289 42.5 0 0
ybr012w-b 1199 2.01 0.0414 48.5 0 0
# Draw an histogram of the p-values
h1 <- hist(resultDESeq2$pvalue, breaks=20, col="grey", main="DESeq2 p-value distribution", xlab="Nominal P-value", ylab="Number of genes")
# Plot an horizontal bar with the estimation of m0 according to Storey-Tibshirani method (2003)
m <- sum(h1$counts) # Number of genes with a p-value
m0 <- 2*(sum(as.vector(na.omit(resultDESeq2$pvalue)) >0.5)) # Estimation of the number of genes under H0
m1 <- m - m0 # Estimation of the number of genes under m1
nbins <- length(h1$counts)
abline(v=m0/nbins, col="blue")
For each gene \(i\) we tested the hypothesis of equality between the mean levels of expressions for Snf2 (\(\mu_{i,Sbf2}\)) and wild-type ((\(\mu_{i,WT}\)) strains.
The null hypothesis is that the Snf2 and WT samples were drawn from populations having the same mean for the considered gene.
\[H_0: \mu_{i,Snf2} = \mu_{i,WT}\]
We computed the mean and dispersion of the read counts in \(n=48\) replicates per condition. We measure the difference between the means per condition for these samples, and we need to estimate their significance relative to the dispersion. Indeed, a same change in expression (e.g. a fold-change of 1.5) can be significant if we have a large number of replicates and if the counts per gene are highly reproducible between samples of the same condition (i.e. if their dispersion per sample is small). However, the same fold-change might result from the sampling effects if it is observed for a gene with high fluctuations between replicates, or if our number of replicates is too small to achieve a reliable estimation of the mean and standard deviation.
The goal of the test of hypothesis is precisely to estimate the significance of the observed effect (the difference between log2 of the normalized counts, which corresponds to the log2 fold change), whilst taking into account the dispersion of the measurements across replicates.
The p-value indicates the probability to obtain, under the null hypothesis, an effect at least as extreme as the measured effect. This p-value can be interpreted as a False Positive Risk (FPR), i.e. the risk to declare a gene significant whereas it is expressed at the same level in the Snf2 and WT strains.
Here we analyzed 6 887 genes, and tested the null hypothesis for each of them.
alpha <- 0.01 # Threshold on the p-value
# par(mfrow=c(1,2))
# Compute significance, with a maximum of 320 for the p-values set to 0 due to limitation of computation precision
resultDESeq2$sig <- -log10(resultDESeq2$padj)
sum(is.infinite(resultDESeq2$sig))
[1] 91
resultDESeq2[is.infinite(resultDESeq2$sig),"sig"] <- 350
# View(resultDESeq2[is.na(resultDESeq2$pvalue),])
# Select genes with a defined p-value (DESeq2 assigns NA to some genes)
genes.to.plot <- !is.na(resultDESeq2$pvalue)
# sum(genes.to.plot)
range(resultDESeq2[genes.to.plot, "log2FoldChange"])
[1] -3.62 7.54
# View(resultDESeq2[genes.to.plot,])
## Volcano plot of adjusted p-values
cols <- densCols(resultDESeq2$log2FoldChange, resultDESeq2$sig)
cols[resultDESeq2$pvalue ==0] <- "purple"
resultDESeq2$pch <- 19
resultDESeq2$pch[resultDESeq2$pvalue ==0] <- 6
plot(resultDESeq2$log2FoldChange,
resultDESeq2$sig,
col=cols, panel.first=grid(),
main="Volcano plot",
xlab="Effect size: log2(fold-change)",
ylab="-log10(adjusted p-value)",
pch=resultDESeq2$pch, cex=0.4)
abline(v=0)
abline(v=c(-1,1), col="brown")
abline(h=-log10(alpha), col="brown")
## Plot the names of a reasonable number of genes, by selecting those begin not only significant but also having a strong effect size
gn.selected <- abs(resultDESeq2$log2FoldChange) > 2 & resultDESeq2$padj < alpha
text(resultDESeq2$log2FoldChange[gn.selected],
-log10(resultDESeq2$padj)[gn.selected],
lab=rownames(resultDESeq2)[gn.selected ], cex=0.6)
# par(mfrow=c(1,1))
The volcano plot enables to simultaneously capture the effect size (abcsissa) and significance (ordinate) of each tested gene.
For RNA-seq transcriptome data, the effect size is estimated by the log2 fold-change (\(log2FC\)), i.e. the logarithm in base 2 of the ratio between normalized counts in the test and control conditions.
Positive \(X\) values denote genes over-expressed in the test relative to the control.
Negative values denote genes under-expressed in the test relative to the control.
Effect size (\(logFC\)) | Fold-change | Interpretation |
---|---|---|
4 | \(FC=2^4 = 16\) | this gene has a 16 times higher expression level in test than in control conditions |
-4 | \(FC=2^{-4} = 0.0625\) | expression level 16 times lower in test than in control |
1 | \(FC=2^1=2\) | expression level 2 times lower in test than in control |
0 | \(FC=2^0=1\) | same expression level in test and control |
The significance is the minus log10 of the p-value.
It may be important to check the validity of our analysis by simply assessing the expression level of the most highly differential gene.
selectedGenes <- c(
"Most significant" = rownames(resultDESeq2)[which.max(resultDESeq2$sig)])
# top.logFC = rownames(resultDESeq2)[which.max(resultDESeq2$log2FoldChange)])
#gn.most.sign <- rownames(resultDESeq2)[1]
#gn.most.diff.val <- counts(dds.norm, normalized=T)[gn.most.sign,]
## Select a gene with small fold change but high significance
sel1 <- subset(
na.omit(resultDESeq2),
sig >= 50 & log2FoldChange > 0 & log2FoldChange < 0.5)
# dim(sel1)
selectedGenes <- append(selectedGenes,
c("Small FC yet significant"=rownames(sel1)[1]))
## Select the non-significant gene with the highest fold change
sel2 <- subset(x = na.omit(resultDESeq2), padj > alpha & log2FoldChange > 0 & baseMean > 1000 & baseMean < 10000)
# dim(sel2)
sel2 <- sel2[order(sel2$log2FoldChange, decreasing = TRUE),][1,]
selectedGenes <- append(
selectedGenes,
c("Non-significant"=rownames(sel2)[1]))
par(mfrow=c(length(selectedGenes),1))
for (g in selectedGenes) {
barplot(counts(dds.norm, normalized=TRUE)[g,],
col=phenoTable$color,
main=g, las=2, cex.names=0.5)
}
par(mfrow=c(1,1))
One popular diagram in dna chip analysis is the M versus A plot (MA plot) between two conditions \(a\) and \(b\). In this representation :
# Draw a MA plot.
# Genes with adjusted p-values below 1% are shown
plotMA(resultDESeq2, colNonSig = "blue", main="MA plot")
abline(h=c(-1:1), col="red")
To ensure that the selected genes distinguish well between “treated”" and “untreated” condition we will perform a hierachical clustering using the heatmap.2
function from the gplots library.
# We select gene names based on FDR (1%)
gene.kept <- rownames(resultDESeq2)[resultDESeq2$padj <= alpha & !is.na(resultDESeq2$padj)]
# We retrieve the normalized counts for gene of interest
countTable.kept <- log2(countTable + epsilon)[gene.kept, ]
dim(countTable.kept)
[1] 4374 96
# Install the gplots library if needed then load it
if(!require("gplots")){
install.packages("gplots")
}
library("gplots")
# Perform the hierarchical clustering with
# A distance based on Pearson-correlation coefficient
# and average linkage clustering as agglomeration criteria
heatmap.2(as.matrix(countTable.kept),
scale="row",
hclust=function(x) hclust(x,method="average"),
distfun=function(x) as.dist((1-cor(t(x)))/2),
trace="none",
density="none",
labRow="",
cexCol=0.7)
We will now perform functional enrichment using the list of induced genes. This step will be performed using the gProfileR R library.
## JvH: I TEMPORARILY DISACTIVATE THIS SECTIO BECAUSE I HAVE NO INTERNET CONNECTION
library(gProfileR)
resultDESeq2.df <- na.omit(data.frame(resultDESeq2))
induced.sign <- rownames(resultDESeq2.df)[resultDESeq2.df$log2FoldChange >= 2 & resultDESeq2.df$padj < alpha]
# head(induced.sign)
# names(term.induced)
term.induced <- gprofiler(query=induced.sign, organism="scerevisiae")
term.induced <- term.induced[order(term.induced$p.value),]
# term.induced$p.value
kable(term.induced[1:10,c("term.name",
"term.size",
"query.size",
"overlap.size",
"recall",
"precision",
"p.value",
"intersection")],
format.args=c(engeneer=TRUE, digits=3), caption="**Table: functional analysis wit gProfileR. ** ")
And now using the list of repressed genes.
resultDESeq2.df <- na.omit(data.frame(resultDESeq2))
repressed.sign <- rownames(resultDESeq2.df)[resultDESeq2.df$log2FoldChange <= -2 & resultDESeq2.df$padj < alpha]
head(repressed.sign)
term.repressed <- gprofiler(query=repressed.sign, organism="scerevisiae")
term.repressed <- term.repressed[order(term.repressed$p.value),]
kable(head(term.induced[,c("p.value", "term.name","intersection")], 10))
Using a loop, randomly select 10 times 2,5,10,15..45 samples from WT and Snf2 KO. Perform differential expression calls and draw a diagram showing the number of differential expressed genes.
## Local paths: create a directory to store the results
resultDir <- ("~/jgb71e_practicals/rnaseq_snf2_Schurch_2015/results")
dir.create(resultDir, showWarnings = FALSE, recursive = TRUE)
setwd(resultDir)
# Export the table with statistics per sample.
write.table(statsPerSample, file=file.path(resultDir, "stats_per_sample.tsv"),
quote=FALSE, sep="\t", col.names =NA, row.names = TRUE)
# Export the DESeq2 result table
DESeq2Table <- file.path(resultDir, "yeast_Snf2_vs_WT_DESeq2_diff.tsv")
write.table(resultDESeq2, file=DESeq2Table, col.names = NA, row.names = TRUE, sep="\t", quote = FALSE)