The RNA-Seq dataset we will use in this practical has been produced by Gierliński et al ([@pmid26206307, @pmid27022035]). The dataset is composed of 48 WT yeast samples vs 48 Snf2 knock-out mutant cell line. The prepared RNA-Seq libraries (unstranded) were pooled and sequenced on seven lanes of a single flow-cell on an Illumina HiSeq 2000 resulting in a total of 1 billion 50-bp single-end reads across the 96 samples. RNA-Seq reads have been cleaned, mapped and counted to generated a count data matrix containing 7126 rows/genes. The primary objective of this study was to check whether the observed gene read counts distribution where consistent with theorical models (e.g. negative binomial). More information can be obtained in the original paper (pdf)
R enables to download data directly from the Web. The expression matrix and phenotypic information will be loaded into R using the read.table function. Both table will be converted into a data.frame object when loaded into R. The ‘count.table’ object will contains counts for each gene (row) and each sample (column).
# Download data files from the Web site (only if not done yet)
<- "http://jvanheld.github.io/stats_avec_RStudio_EBA/practicals/yeast_2x48_replicates/data/"
url.counts
## Local paths: create a directory to store the results
<- ("~/ASG/practicals/rnaseq_snf2_Schurch_2015")
dir.snf2 <- file.path(dir.snf2, "data")
dir.counts <- file.path(dir.counts, "counts.txt")
file.counts <- file.path(dir.counts, "expDesign.txt")
file.expDesign
## Create a directory to download the dataset if it does not exist
dir.create(dir.counts, showWarnings = FALSE, recursive = TRUE)
## Download the data files if required
if (!file.exists(file.counts)) {
message("Downloading count table from ", url.counts)
download.file(url=file.path(url.counts, "counts.txt"), destfile = file.counts)
}if (!file.exists(file.expDesign)) {
message("Downloading design table from ", url.counts)
download.file(url=file.path(url.counts, "expDesign.txt"), destfile = file.expDesign)
}
# Load the count table
<- read.table(file=file.counts, sep="\t", header=TRUE, row.names=1)
count.table # View(count.table)
The dataset contains RNA-Seq count data for a wild type strain (WT) and a Snf2 mutant, with 48 biological replicates for each genotype.
All phenotypic informations are enclosed in a dedicated file. Note that the produced data.frame encodes the ‘strains’ columns as a factor1.
# Load experimental design file
<- read.table(file=file.expDesign, sep="\t", header=TRUE)
expDesign #View(expDesign)
# Check the first and last line of the phenotypic data
head(expDesign)
label strain
1 WT1 WT
2 WT2 WT
3 WT3 WT
4 WT4 WT
5 WT5 WT
6 WT6 WT
tail(expDesign)
label strain
91 Snf43 Snf
92 Snf44 Snf
93 Snf45 Snf
94 Snf46 Snf
95 Snf47 Snf
96 Snf48 Snf
## Count the number of sample in each class
table(expDesign$strain)
Snf WT
48 48
## Define a strain-specific color for each sample,
## and add it as a supplementary column to the phenotypic data
<- c(WT="green", Snf="orange") # Choose one color per strain
col.strain $color <- col.strain[as.vector(expDesign$strain)] expDesign
colSums()
or the apply()
function
(run help(colSums()
if you don’t know this function).Before going further in the analysis, we will compute some descriptive statistics on the dataset. At this stage we only compute statistics per sample, since statistics per gene are meaningful only after library-wise normalization of the counts.
## Dimensions
<- ncol(count.table)
nb.samples print(nb.samples)
[1] 96
<- nrow(count.table)
nb_genes print(nb_genes)
[1] 7126
dim(count.table)
[1] 7126 96
## Min, Max, median (...).
## Here on the first 4 samples
head(summary(count.table[,1:4]))
WT1 WT2 WT3 WT4
Min. : 0 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 49 1st Qu.: 88 1st Qu.: 74 1st Qu.: 104
Median : 224 Median : 371 Median : 297 Median : 430
Mean : 837 Mean : 1107 Mean : 903 Mean : 1464
3rd Qu.: 561 3rd Qu.: 853 3rd Qu.: 670 3rd Qu.: 1019
Max. :188825 Max. :196804 Max. :172119 Max. :328674
## A magic trick to convert column-wise summaries into a data.frame.
## The do.call() function produces a data frame with one col per sample,
## we transpose it to obtain one row per sample and one column per statistics.
<- data.frame(t(do.call(cbind, lapply(count.table, summary))))
stats.per.sample head(stats.per.sample)
Min. X1st.Qu. Median Mean X3rd.Qu. Max.
WT1 0 49.0 224 837 561 188825
WT2 0 88.0 371 1107 853 196804
WT3 0 74.2 297 903 670 172119
WT4 0 104.2 430 1464 1019 328674
WT5 0 84.0 353 1124 819 225435
WT6 0 153.0 667 2052 1599 357247
## We can now add some columns to the stats per sample
$libsum <- apply(count.table, 2, sum) ## libsum
stats.per.sample# Add some percentiles
$perc05 <- apply(count.table, 2, quantile, 0.05)
stats.per.sample$perc10 <- apply(count.table, 2, quantile, 0.10)
stats.per.sample$perc90 <- apply(count.table, 2, quantile, 0.90)
stats.per.sample$perc95 <- apply(count.table, 2, quantile, 0.95)
stats.per.sample$zeros <- apply(count.table==0, 2, sum)
stats.per.sample$percent.zeros <- 100*stats.per.sample$zeros/nrow(count.table)
stats.per.sample
# View(stats.per.sample)
kable(stats.per.sample[sample(1:ncol(count.table), size = 10),],
caption = "**Table: statistics per sample. ** We only display a random selection of 10 samples. ")
Min. | X1st.Qu. | Median | Mean | X3rd.Qu. | Max. | libsum | perc05 | perc10 | perc90 | perc95 | zeros | percent.zeros | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Snf4 | 0 | 137.0 | 562 | 1496 | 1268 | 204494 | 10662188 | 0 | 3.0 | 2802 | 4739 | 456 | 6.40 |
WT16 | 0 | 50.0 | 216 | 782 | 526 | 173478 | 5569418 | 0 | 1.0 | 1284 | 2388 | 621 | 8.71 |
WT12 | 0 | 76.0 | 332 | 1152 | 802 | 257336 | 8212630 | 0 | 1.0 | 1920 | 3536 | 560 | 7.86 |
WT23 | 0 | 79.0 | 345 | 1072 | 822 | 185052 | 7640884 | 0 | 2.0 | 1934 | 3416 | 535 | 7.51 |
Snf46 | 0 | 108.0 | 450 | 1236 | 1005 | 170179 | 8808358 | 0 | 2.0 | 2272 | 3984 | 499 | 7.00 |
Snf17 | 0 | 126.0 | 493 | 1429 | 1104 | 229969 | 10184815 | 0 | 2.5 | 2598 | 4460 | 474 | 6.65 |
Snf33 | 0 | 146.0 | 616 | 1653 | 1368 | 233323 | 11776986 | 0 | 3.0 | 3066 | 5280 | 453 | 6.36 |
Snf43 | 0 | 119.0 | 476 | 1343 | 1057 | 219417 | 9568585 | 0 | 2.0 | 2456 | 4182 | 482 | 6.76 |
Snf28 | 0 | 92.2 | 374 | 1078 | 850 | 180213 | 7681488 | 0 | 2.0 | 1957 | 3413 | 530 | 7.44 |
WT38 | 0 | 54.0 | 245 | 1010 | 620 | 270499 | 7197961 | 0 | 1.0 | 1682 | 3078 | 661 | 9.28 |
The summary only displays a few milestone values (mean, median, quartiles). In order to get a better intuition of the data, we can draw an histogram of all values.
par(mfrow=c(3,1))
hist(as.matrix(count.table), col="blue", border="white", breaks=100)
hist(as.matrix(count.table), col="blue", border="white",
breaks=20000, xlim=c(0,2000), main="Counts per gene",
xlab="Counts (truncated axis)", ylab="Number of genes",
las=1, cex.axis=0.7)
<- 1 # pseudo-count to avoid problems with log(0)
epsilon hist(as.matrix(log2(count.table + epsilon)), breaks=100, col="blue", border="white",
main="Log2-transformed counts per gene", xlab="log2(counts+1)", ylab="Number of genes",
las=1, cex.axis=0.7)
par(mfrow=c(1,1))
To get better insights into the distribution per sample, boxplots offer a good perspective.
## Boxplots
boxplot(log2(count.table + epsilon), col=expDesign$color, pch=".",
horizontal=TRUE, cex.axis=0.5,
las=1, ylab="Samples", xlab="log2(Counts +1)")
Another way to get an intuition of the value distributions is to use the geom_density() function (ggplot2 library), which draws one density curve for each sample.
## Density
## We will require some functions from the reshape2 and ggplot2 packages
if(!require("reshape2")){
install.packages("reshape2")
}
if(!require("ggplot2")){
install.packages("ggplot2")
}
library(ggplot2)
library(reshape2)
<- reshape2::melt(log2(count.table + epsilon))
count_melt head(count_melt)
variable value
1 WT1 1.58
2 WT1 4.39
3 WT1 2.00
4 WT1 6.25
5 WT1 5.93
6 WT1 3.81
ggplot(data=count_melt, mapping=aes(x=value, color=variable)) + geom_density()
Beware: the R function geom_density() does not display the actual distribution of your values, but a polynomial fit. The representation thus generally looks smoother than the actual data. It is important to realize that, in some particular cases, the fit can lead to extrapolated values which can be misleading.
<- 6
nb.pairs
## Define a function to draw a scatter plot for a pair of variables (samples) with density colors
<- function(x,y){
plotFun <- densCols(x,y);
dns points(x,y, col=dns, pch=".", panel.first=grid());
# abline(a=0, b=1, col="brown")
}
## Plot the scatter plot for a few pairs of variables selected at random
set.seed(123) # forces the random number generator to produce fixed results. Should generally not be used, except for the sake of demonstration with a particular selection.
pairs(log2(count.table[,sample(ncol(count.table), nb.pairs)] + epsilon),
panel=plotFun, lower.panel = NULL)
Let’s have a look at the scatter plots using the pairs() function. We will only represent 6 randomly selected samples.
The command pairs() draws a scatter plot for each pair of columns of the input dataset. The plot shows a fairly good reproducibility between samples of the same type (WT or KO, respectively): all points are aligned along te diagonal, with a relatively wider dispersion at the bottom, corresponding to small number fluctuations.
In contrast, on all the plots comparing a WT and a KO sample, we can see some points (genes) discarding from the diagonal.
All genes from genome the S. cerevisiae where quantified. However only a fraction of them where expressed and some of them where to weakly expressed to be detected in any of the sample. As a result the count table may contain rows with only zero values (null counts).
One of the questions that will drive the analysis will be to define the impact of the number of biological samples on the results.
The original study contained 48 replicates per genotype, what happens if we select a smaller number?
Each attendee of this course select a given number (e.g. 3, 4, 5, 10, 15, 20, 35, 40, 45…) and adapt the code below run the analysis with that number of replicates per genotype. We will at the end then compare the results (number of genes, significance, …).
<- 10 ## Each attendee chooses a number (3,4,5,10,15 or 20)
nb.replicates
<- sample(1:48, size=nb.replicates, replace=FALSE)
samples.WT
## Random sampling of the Snf2 replicates (columns 49 to 96)
<- sample(49:96, size=nb.replicates, replace=FALSE)
samples.Snf2
<- c(samples.WT, samples.Snf2)
selected.samples
# Don't forget to update colors
<- expDesign$color[selected.samples] col.pheno.selected
In this section we will search for genes whose expression is affected by the genetic invalidation. You will first need to install the DESeq2 bioconductor library then load it.
## Install the library if needed then load it
if (!require("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
::install()
BiocManager
}
if(!require("lazyeval")){
install.packages("lazyeval")
}
if(!require("DESeq2")){
::install("DESeq2")
BiocManager
}
library("DESeq2")
We will then create a DESeqDataSet using the DESeqDataSetFromMatrix() function. Get some help about the DESeqDataSet and have a look at some important accessor methods: counts, conditions, estimateSizeFactors, sizeFactors, estimateDispersions and nbinomTest.
## Use the DESeqDataSetFromMatrix to create a DESeqDataSet object
<- DESeqDataSetFromMatrix(countData = count.table[,selected.samples ], colData = expDesign[selected.samples,], design = ~ strain)
dds0 print(dds0)
class: DESeqDataSet
dim: 6887 20
metadata(1): version
assays(1): counts
rownames(6887): 15s_rrna 21s_rrna ... ty(gua)m2 ty(gua)o
rowData names(0):
colnames(20): WT43 WT37 ... Snf36 Snf14
colData names(3): label strain color
## What kind of object is it ?
is(dds0)
[1] "DESeqDataSet" "RangedSummarizedExperiment"
[3] "SummarizedExperiment" "RectangularData"
[5] "Vector" "Annotated"
[7] "vector_OR_Vector"
isS4(dds0)
[1] TRUE
## What does it contain ?
# The list of slot names
slotNames(dds0)
[1] "design" "dispersionFunction" "rowRanges"
[4] "colData" "assays" "NAMES"
[7] "elementMetadata" "metadata"
## Get some help about the "CountDataSet" class.
## NOT RUN
#?"DESeqDataSet-class"
The normalization procedure (RLE) is implemented through the estimateSizeFactors function.
Given a matrix with \(p\) columns (samples) and \(n\) rows (genes) this function estimates the size factors as follows: Each column element is divided by the geometric means of the rows. For each sample, the median (or, if requested, another location estimator) of these ratios (skipping the genes with a geometric mean of zero) is used as the size factor for this column.
The scaling factor for sample \(j\) is thus obtained as:
\[sf_{j} = median(\frac{K_{g,j}}{(\prod_{j=1}^p K_{g,j})^{1/p}}) \]
### Let's implement such a function
### cds is a countDataset
<- function (cds){
estimSf # Get the count matrix
<- counts(cds)
cts
# Compute the geometric mean
<- function(x) prod(x)^(1/length(x))
geomMean
# Compute the geometric mean over the line
<- apply(cts, 1, geomMean)
gm.mean
# Zero values are set to NA (avoid subsequentcdsdivision by 0)
== 0] <- NA
gm.mean[gm.mean
# Divide each line by its corresponding geometric mean
# sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
# MARGIN: 1 or 2 (line or columns)
# STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
# FUN: the function to be applied
<- sweep(cts, 1, gm.mean, FUN="/")
cts
# Compute the median over the columns
<- apply(cts, 2, median, na.rm=TRUE)
med
# Return the scaling factor
return(med)
}
Now, check that the results obtained with our function are the same as those produced by DESeq. The method associated with normalization for the “CountDataSet” class is estimateSizeFactors().
## Normalizing using the method for an object of class"CountDataSet"
<- estimateSizeFactors(dds0)
dds.norm sizeFactors(dds.norm)
WT43 WT37 WT14 WT25 WT26 WT27 WT5 WT48 WT28 WT9 Snf29 Snf35 Snf8
0.720 0.889 1.070 1.161 0.945 1.018 0.856 0.828 1.139 0.862 1.044 0.923 0.941
Snf26 Snf7 Snf42 Snf9 Snf19 Snf36 Snf14
1.274 1.664 0.825 1.234 0.890 0.946 1.242
## Now get the scaling factor with our homemade function.cds.norm
head(estimSf(dds0))
WT43 WT37 WT14 WT25 WT26 WT27
0.720 0.889 1.070 1.161 0.945 1.018
all(round(estimSf(dds0),6) == round(sizeFactors(dds.norm), 6))
[1] TRUE
## Checking the normalization
par(mfrow=c(1,2),cex.lab=0.7)
boxplot(log2(counts(dds.norm)+epsilon), col=col.pheno.selected, cex.axis=0.7,
las=1, xlab="log2(counts)", horizontal=TRUE, main="Raw counts")
boxplot(log2(counts(dds.norm, normalized=TRUE)+epsilon), col=col.pheno.selected, cex.axis=0.7,
las=1, xlab="log2(normalized counts)", horizontal=TRUE, main="Normalized counts")
if(!require("patchwork")){
install.packages("patchwork")
}
<- ggplot(data=count_melt, mapping=aes(x=value, color=variable)) + geom_density() + theme(legend.position = "none")
p1 <- melt(log2(counts(dds.norm, normalized=TRUE)+epsilon))
count_norm_melt head(count_norm_melt)
Var1 Var2 value
1 15s_rrna WT43 4.03
2 21s_rrna WT43 6.66
3 hra1 WT43 1.92
4 icr1 WT43 7.30
5 lsr1 WT43 7.32
6 nme1 WT43 3.90
<- ggplot(data=count_norm_melt, mapping=aes(x=value, color=Var2)) + geom_density() + theme(legend.position = "none")
p2 + p2 p1
Let us imagine that we would produce a lot of RNA-Seq experiments from the same samples (technical replicates). For each gene \(g\) the measured read counts would be expected to vary rather slighlty around the expected mean and would be probably well modeled using a Poisson distribution. However, when working with biological replicates more variations are intrinsically expected. Indeed, the measured expression values for each genes are expected to fluctuate more importantly, due to the combination of biological and technical factors: inter-individual variations in gene regulation, sample purity, cell-synchronization issues or reponses to environment (e.g. heat-shock).
The Poisson distribution has only one parameter indicating its expected mean : \(\lambda\). The variance of the distribution equals its mean \(\lambda\). Thus in most cases, the Poisson distribution is not expected to fit very well with the count distribution in biological replicates, since we expect some over-dispersion (greater variability) due to biological noise.
As a consequence, when working with RNA-Seq data, many of the current approaches for differential expression call rely on an alternative distribution: the negative binomial (note that this holds true also for other -Seq approaches, e.g. ChIP-Seq with replicates).
The negative binomial distribution is a discrete distribution that give us the probability of observing \(x\) failures before a target number of succes \(n\) is obtained. As we will see later the negative binomial can also be used to model over-dispersed data (in this case this overdispersion is relative to the poisson model).
First, given a Bernouilli trial with a probability \(p\) of success, the negative binomial distribution describes the probability of observing \(x\) failures before a target number of successes \(n\) is reached. In this case the parameters of the distribution will thus be \(p\), \(n\) (in dnbinom() function of R, \(n\) and \(p\) are denoted by arguments size and prob respectively).
\[P_{NegBin}(x; n, p) = \binom{x+n-1}{x}\cdot p^n \cdot (1-p)^x = C^{x}_{x+n-1}\cdot p^n \cdot (1-p)^x \]
In this formula, \(p^n\) denotes the probability to observe \(n\) successes, \((1-p)^x\) the probability of \(x\) failures, and the binomial coefficient \(C^{x}_{x+n-1}\) indicates the number of possible ways to dispose \(x\) failures among the \(x+n-1\) trials that precede the last one (the problem statement imposes for the last trial to be a success).
The negative binomial distribution has expected value \(n\frac{q}{p}\) and variance \(n\frac{q}{p^2}\). Some examples of using this distribution in R are provided below.
Particular case: when \(n=1\) the negative binomial corresponds to the the geometric distribution, which models the probability distribution to observe the first success after \(x\) failures: \(P_{NegBin}(x; 1, p) = P_{geom}(x; p) = p \cdot (1-p)^x\).
par(mfrow=c(1,1))
## Some intuition about the negative binomiale parametrized using n and p.
## The simple case, one success (see geometric distribution)
# Let's have a look at the density
<- 1/6 # the probability of success
p <- 1 # target for number of successful trials
n
# The density function
plot(0:10, dnbinom(0:10, n, p), type="h", col="blue", lwd=4)
# the probability of zero failure before one success.
# i.e the probability of success
dnbinom(0, n , p)
[1] 0.167
## i.e the probability of at most 5 failure before one success.
sum(dnbinom(0:5, n , p)) # == pnbinom(5, 1, p)
[1] 0.665
## The probability of at most 10 failures before one sucess
sum(dnbinom(0:10, n , p)) # == pnbinom(10, 1, p)
[1] 0.865
## The probability to have more than 10 failures before one sucess
1-sum(dnbinom(0:10, n , p)) # == 1 - pnbinom(10, 1, p)
[1] 0.135
## With two successes
## The probability of x failure before two success (e.g. two six)
<- 2
n plot(0:30, dnbinom(0:30, n, p), type="h", col="blue", lwd=2,
main="Negative binomial density",
ylab="P(x; n,p)",
xlab=paste("x = number of failures before", n, "successes"))
# Expected value
<- 1-p
q <- n*q/p) (ev
[1] 10
abline(v=ev, col="darkgreen", lwd=2)
# Variance
<- n*q/p^2) (v
[1] 60
arrows(x0=ev-sqrt(v), y0 = 0.04, x1=ev+sqrt(v), y1=0.04, col="brown",lwd=2, code=3, , length=0.2, angle=20)
The second way of parametrizing the distribution is using the mean value \(m\) and the dispersion parameter \(r\) (in dnbinom() function of R, \(m\) and \(r\) are denoted by arguments mu and size respectively). The variance of the distribution can then be computed as \(m + m^2/r\).
Note that \(m\) can be deduced from \(n\) and \(p\).
<- 10
n <- 1/6
p <- 1-p
q <- n*q/p
mu
all(dnbinom(0:100, mu=mu, size=n) == dnbinom(0:100, size=n, prob=p))
[1] FALSE
To perform diffential expression call DESeq will assume that, for each gene, the read counts are generated by a negative binomial distribution. One problem here will be to estimate, for each gene, the two parameters of the negative binomial distribution: mean and dispersion.
The mean will be estimated from the observed normalized counts in both conditions.
The first step will be to compute a gene-wise dispersion. When the number of available samples is insufficient to obtain a reliable estimator of the variance for each gene, DESeq will apply a shrinkage strategy, which assumes that counts produced by genes with similar expression level (counts) have similar variance (note that this is a strong assumption). DESeq will regress the gene-wise dispersion onto the means of the normalized counts to obtain an estimate of the dispersion that will be subsequently used to build the binomial model for each gene.
## Performing estimation of dispersion parameter
<- estimateDispersions(dds.norm)
dds.disp
## A diagnostic plot which
## shows the mean of normalized counts (x axis)
## and dispersion estimate for each genes
plotDispEsts(dds.disp)
Now that a negative binomial model has been fitted for each gene, the nbinomWaldTest can be used to test for differential expression. The output is a data.frame which contains nominal p-values, as well as FDR values (correction for multiple tests computed with the Benjamini–Hochberg procedure).
<- 0.0001
alpha <- nbinomWaldTest(dds.disp)
wald.test <- results(wald.test, alpha=alpha, pAdjustMethod="BH")
res.DESeq2
## What is the object returned by nbinomTest()
class(res.DESeq2)
[1] "DESeqResults"
attr(,"package")
[1] "DESeq2"
is(res.DESeq2) # a data.frame
[1] "DESeqResults" "DFrame" "DataFrame"
[4] "SimpleList" "RectangularData" "List"
[7] "DataFrame_OR_NULL" "Vector" "list_OR_List"
[10] "Annotated" "vector_OR_Vector"
slotNames(res.DESeq2)
[1] "priorInfo" "rownames" "nrows" "elementType"
[5] "elementMetadata" "metadata" "listData"
head(res.DESeq2)
log2 fold change (MLE): strain WT vs Snf
Wald test p-value: strain WT vs Snf
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
15s_rrna 28.08058 -1.240636 0.8926980 -1.389760 NA NA
21s_rrna 128.41370 -0.572078 0.7000361 -0.817213 NA NA
hra1 2.29627 0.443593 0.5452527 0.813555 0.415900041 0.529763145
icr1 147.68494 -0.346711 0.0913234 -3.796520 0.000146742 0.000559623
lsr1 224.52147 -0.235851 0.4436689 -0.531591 0.595008983 0.696343313
nme1 29.00707 -0.567462 0.3515400 -1.614219 0.106480031 0.179555630
## The column names of the data.frame
## Note the column padj
## contains FDR values (computed Benjamini–Hochberg procedure)
colnames(res.DESeq2)
[1] "baseMean" "log2FoldChange" "lfcSE" "stat"
[5] "pvalue" "padj"
## Order the table by decreasing p-valuer
<- res.DESeq2[order(res.DESeq2$padj),]
res.DESeq2 head(res.DESeq2)
log2 fold change (MLE): strain WT vs Snf
Wald test p-value: strain WT vs Snf
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ygr234w 2965.003 4.12298 0.0939529 43.8835 0.00000e+00 0.00000e+00
yhr215w 1298.565 4.37142 0.1163141 37.5829 0.00000e+00 0.00000e+00
yil121w 926.134 2.74561 0.0662922 41.4169 0.00000e+00 0.00000e+00
yor290c 853.322 7.41602 0.1787839 41.4803 0.00000e+00 0.00000e+00
yar071w 1666.340 4.13109 0.1111865 37.1546 3.69040e-302 4.86395e-299
ydr033w 4051.107 3.60699 0.1045447 34.5019 7.51713e-261 8.25631e-258
## Draw an histogram of the p-values
hist(res.DESeq2$padj, breaks=20, col="grey", main="DESeq2 p-value distribution", xlab="DESeq2 P-value", ylab="Number of genes")
<- 0.01 # Threshold on the adjusted p-value
alpha <- densCols(res.DESeq2$log2FoldChange, -log10(res.DESeq2$pvalue))
cols plot(res.DESeq2$log2FoldChange, -log10(res.DESeq2$padj), col=cols, panel.first=grid(),
main="Volcano plot", xlab="Effect size: log2(fold-change)", ylab="-log10(adjusted p-value)",
pch=20, cex=0.6)
abline(v=0)
abline(v=c(-1,1), col="brown")
abline(h=-log10(alpha), col="brown")
<- abs(res.DESeq2$log2FoldChange) > 2 & res.DESeq2$padj < alpha
gn.selected text(res.DESeq2$log2FoldChange[gn.selected],
-log10(res.DESeq2$padj)[gn.selected],
lab=rownames(res.DESeq2)[gn.selected ], cex=0.4)
It may be important to check the validity of our analysis by simply assessing the expression level of the most highly differential gene.
<- rownames(res.DESeq2)[1]
gn.most.sign <- counts(dds.norm, normalized=T)[gn.most.sign,]
gn.most.diff.val barplot(gn.most.diff.val, col=col.pheno.selected, main=gn.most.sign, las=2, cex.names=0.5)
One popular diagram in dna chip analysis is the M versus A plot (MA plot) between two conditions \(a\) and \(b\). In this representation :
## Draw a MA plot.
## Genes with adjusted p-values below 1% are shown
plotMA(res.DESeq2, colNonSig = "blue")
abline(h=c(-1:1), col="red")
To ensure that the selected genes distinguish well between “treated””
and “untreated” condition we will perform a hierachical clustering using
the heatmap.2()
function from the gplots
library.
## We select gene names based on FDR (1%)
<- rownames(res.DESeq2)[res.DESeq2$padj <= alpha & !is.na(res.DESeq2$padj)]
gene.kept
## We retrieve the normalized counts for gene of interest
<- log2(count.table + epsilon)[gene.kept, ]
count.table.kept dim(count.table.kept)
[1] 2398 96
## Install the gplots library if needed then load it
if(!require("gplots")){
install.packages("gplots")
}library("gplots")
## Perform the hierarchical clustering with
## A distance based on Pearson-correlation coefficient
## and average linkage clustering as agglomeration criteria
heatmap.2(as.matrix(count.table.kept),
scale="row",
hclust=function(x) hclust(x,method="average"),
distfun=function(x) as.dist((1-cor(t(x)))/2),
trace="none",
density="none",
labRow="",
cexCol=0.7)
We will now perform functional enrichment using the list of induced genes. This step will be performed using the gProfileR R library.
library(gProfileR)
<- na.omit(data.frame(res.DESeq2))
res.DESeq2.df <- rownames(res.DESeq2.df)[res.DESeq2.df$log2FoldChange >= 2 & res.DESeq2.df$padj < alpha]
induced.sign # head(induced.sign)
# names(term.induced)
<- gprofiler(query=induced.sign, organism="scerevisiae")
term.induced <- term.induced[order(term.induced$p.value),]
term.induced # term.induced$p.value
kable(term.induced[1:10,c("term.name",
"term.size",
"query.size",
"overlap.size",
"recall",
"precision",
"p.value",
"intersection")],
format.args=c(engeneer=TRUE, digits=3), caption="**Table: functional analysis wit gProfileR. ** ")
term.name | term.size | query.size | overlap.size | recall | precision | p.value | intersection | |
---|---|---|---|---|---|---|---|---|
39 | RNA-DNA hybrid ribonuclease activity | 48 | 85 | 30 | 0.625 | 0.353 | 0 | YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
10 | DNA integration | 50 | 85 | 30 | 0.600 | 0.353 | 0 | YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
49 | RNA-directed DNA polymerase activity | 52 | 85 | 30 | 0.577 | 0.353 | 0 | YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
42 | aspartic-type peptidase activity | 54 | 85 | 29 | 0.537 | 0.341 | 0 | YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
44 | aspartic-type endopeptidase activity | 54 | 85 | 29 | 0.537 | 0.341 | 0 | YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
38 | endoribonuclease activity, producing 5’-phosphomonoesters | 63 | 85 | 30 | 0.476 | 0.353 | 0 | YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
50 | DNA-directed DNA polymerase activity | 63 | 85 | 30 | 0.476 | 0.353 | 0 | YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
48 | DNA polymerase activity | 67 | 85 | 30 | 0.448 | 0.353 | 0 | YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
35 | endonuclease activity, active with either ribo- or deoxyribonucleic acids and producing 5’-phosphomonoesters | 74 | 85 | 30 | 0.405 | 0.353 | 0 | YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
37 | endoribonuclease activity | 77 | 85 | 30 | 0.390 | 0.353 | 0 | YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
And now using the list of repressed genes.
<- na.omit(data.frame(res.DESeq2))
res.DESeq2.df <- rownames(res.DESeq2.df)[res.DESeq2.df$log2FoldChange <= -2 & res.DESeq2.df$padj < alpha]
repressed.sign head(repressed.sign)
[1] "yer081w" "ypl025c" "ygr051c" "ycr025c" "ypr064w" "ygr087c"
<- gprofiler(query=repressed.sign, organism="scerevisiae")
term.repressed <- term.repressed[order(term.repressed$p.value),]
term.repressed kable(head(term.induced[,c("p.value", "term.name","intersection")], 10))
p.value | term.name | intersection | |
---|---|---|---|
39 | 0 | RNA-DNA hybrid ribonuclease activity | YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
10 | 0 | DNA integration | YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
49 | 0 | RNA-directed DNA polymerase activity | YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
42 | 0 | aspartic-type peptidase activity | YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
44 | 0 | aspartic-type endopeptidase activity | YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
38 | 0 | endoribonuclease activity, producing 5’-phosphomonoesters | YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
50 | 0 | DNA-directed DNA polymerase activity | YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
48 | 0 | DNA polymerase activity | YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
35 | 0 | endonuclease activity, active with either ribo- or deoxyribonucleic acids and producing 5’-phosphomonoesters | YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
37 | 0 | endoribonuclease activity | YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B |
Using a loop, randomly select 10 times 2,5,10,15..45 samples from WT and Snf2 KO. Perform differential expression calls and draw a diagram showing the number of differential expressed genes.
## Create a directory to store the results that will be obtained below
<- file.path(dir.snf2, "results")
dir.results dir.create(dir.results, showWarnings = FALSE, recursive = TRUE)
## Export the table with statistics per sample.
write.table(stats.per.sample, file=file.path(dir.results, "stats_per_sample.tsv"),
quote=FALSE, sep="\t", col.names =NA, row.names = TRUE)
# Export the DESeq2 result table
<- file.path(dir.results, "yeast_Snf2_vs_WT_DESeq2_diff.tsv")
DESeq2.table write.table(res.DESeq2, file=DESeq2.table, col.names = NA, row.names = TRUE, sep="\t", quote = FALSE)
A factor is a vector with levels (categories), which
permits an efficient storage and indexing, but can in some cases lead to
misleading effects. To circumvent this, we will sometimes have to
convert the factor to a vector, with the R command
as.vector()
. ↩︎