The Snf2 dataset
Loading the dataset
Descriptive statistics
Eliminating undetected genes
Selecting random samples
Differential analysis with DESeq2
- Analysis of the full dataset
  - Creating a DESeqDataSet dataset
Normalization
- How is the scaling factor computed ?
Count variance is related to mean
- Interpretation
Modeling read counts
Performing differential expression call
Hierarchical clustering
Functional enrichment
Exercise: assess the effect of sample number on differential expression call

The Snf2 dataset

The RNA-Seq dataset we will use in this practical has been produced by Gierliński et al, 2015) and (Schurch et al, 2016)).

The dataset is composed of 48 samples of yeast wild-type (WT) strain, and 48 samples of Snf2 knock-out mutant cell line. The prepared RNA-Seq libraries (unstranded) were pooled and sequenced on seven lanes of a single flow-cell on an Illumina HiSeq 2000 resulting in a total of 1 billion 50-bp single-end reads across the 96 samples. RNA-Seq reads have been cleaned, mapped and counted to generated a count data matrix containing $7126$ rows/genes.

In this tutorial, we will start from the count table generated by the authors of the study, and use different R tools in order to

explore the distribution of the data;
test alternative normalisation procedures;
select differentially expressed genes (DEG);
cluster the count table to regroup genes having similar transcription profiles;
compare the lists of co-regulated genes with the annotations from the Gene Ontology, in order to test if they are statistically enriched for particular functional classes.

Loading the dataset

R enables to download data directly from the Web with the download.file function. After, we begin with some verification steps.

The expression matrix and phenotypic information will be loaded into R using the read.table function. Both table will be converted into a data.frame object when loaded into R.

The countTable object should contain counts for each gene (row) and each sample (column).
The phenoTable object should contain the designation of each genotype, that is 48 biological replicates of the wild type strain (WT) and the Snf2 mutant.

Note that the created tables include column names as factor object.

Reminder: a factor is a vector with levels (categories), which permits an efficient storage and indexing, but can in some cases lead to misleading effects. To circumvent this, we will sometimes have to convert the factor to a vector, with the R command as.vector().

            WT1   WT2   WT3   WT4   WT5
yal034w-a    78   116   103   113   113
yal035w    2815  4160  2969  4411  4836
yal036c     331   599   348   525   691
yal037c-a     0    37    23    15    51
yal037c-b    70   183   199   208   196
yal037w      22    69    32    58    51
yal038w   21080 18869 14649 28551 20208
yal039c    1541  1447  1318  2251  1501

[1] "strain" "color"

# Load the count table
countTable <- read.table("http://jvanheld.github.io/stats_avec_RStudio_EBA/practicals/yeast_2x48_replicates/data/counts.txt", header=TRUE, row.names=1)

# View part of the data/table
countTable[101:108, 1:5]

# Load the sample description file 
# (traditionally named "pheno" for "phenotype", athough in our case the conditions are genotypes)
phenoTable <- read.table("http://jvanheld.github.io/stats_avec_RStudio_EBA/practicals/yeast_2x48_replicates/data/expDesign.txt", header=TRUE, row.names=1)


# Define a strain-specific color for each sample, and add it as a supplementary column
strainColor <- c("WT"="green","Snf2"="orange") # Choose one color per strain
print(strainColor)

## Add a column to the pheno type to specify the color of each sample according to its genotype
phenoTable$color[phenoTable$strain == "Snf"] <- strainColor["Snf2"]
phenoTable$color[phenoTable$strain == "WT"] <- strainColor["WT"]

# Check the dimensions of the pheno table
colnames(phenoTable)

Exercice

Find the commands to visualize the first and last line of a data.frame object (clue: similar to bash commands)

View solution|Hide solution

Check the dimensions of the data (tables).
1. How many genes do we have?
2. How many samples?
3. Check that the dimensions of the sample table and pheno table are consistent.

View solution|Hide solution

Check the number of replicates in each class using the table function. To better understand what you could do with this function, play with it (varying the parameters) with the following test table.

## Quick test with a small table generated on the flight
testTable <- data.frame(
  ABC = sample(c("A", "B", "C", "D", "E"), replace=TRUE, 20),
  gender = sample(c("boy", "girl", "robot"), replace=TRUE, 20),
  yesno = sample(c("yes", "no"), replace=TRUE, 20))

View solution|Hide solution

Draw a barplot showing the total number of reads in each sample (clue: use colSums function) with sample colored (according to the color column in phenoTable). What can you say from this diagram?

View solution|Hide solution

Descriptive statistics

Basic statistics

Before going further in the analysis, we will compute some descriptive statistics on the dataset. At this stage we only compute statistics per sample, since statistics per gene are meaningful only after library-wise normalization of the counts.

# Min, Max, median (...). 
# Here on the first 4 samples
summary(countTable[,1:4])

Among the reference functions in R, apply (and its derivates) allow to repeated execution of a function to a serie of data, at a noteworthy higher speed tha using for loops. Note that, if working on a matrix/data frame, the apply function requires that you indicate if you want to apply the function to the rows (MARGIN=1) or to the columns (MARGIN=2). Some built-in functions already exist as colMeans, colSums, rowMeans and rowSums. Let’s use apply to get the summary for every column of the count table. It produces a matrix as output that we transpose to obtain samples as rows and the statistics as columns, and move back to a data.frame object.

statsPerSample <- data.frame(t(apply(countTable, 2, summary)))
head(statsPerSample)

    Min. X1st.Qu. Median Mean X3rd.Qu.   Max.
WT1    0     49.0    224  837      561 189000
WT2    0     88.0    371 1110      853 197000
WT3    0     74.2    297  903      670 172000
WT4    0    104.0    430 1460     1020 329000
WT5    0     84.0    353 1120      819 225000
WT6    0    153.0    667 2050     1600 357000

# Redefine column nemes for readability
names(statsPerSample) <- c("min", "Q1", "median", "mean", "Q3", "max")
kable(head(statsPerSample))

	Q1	median	mean	Q3	max
WT1	49.0	224	837	561	189000
WT2	88.0	371	1110	853	197000
WT3	74.2	297	903	670	172000
WT4	104.0	430	1460	1020	329000
WT5	84.0	353	1120	819	225000
WT6	153.0	667	2050	1600	357000

We can now add some columns to the stats per sample. For example, here are the library sums and the 5% quantile.

statsPerSample$libsum <- colSums(countTable) # library sums
statsPerSample$perc05 <- apply(countTable, 2, quantile, probs=0.05)

Exercice

Integrate some more columns to statsPerSample: - the 10%, 90% and 95% quantile;

View solution|Hide solution

the number of zeros for each column;

View solution|Hide solution

the percentage these zeros represent. View solution|Hide solution

## Add a column to the stats per sample, with the percentage of genes with zero counts
statsPerSample$percZeros <- 100*statsPerSample$zeros/nrow(countTable)

Finally, randomly choose (with the sample() function) 10 samples (rows) of the stats per sample table and look at their statistics.

statsPerSample[sample(1:ncol(countTable), size = 10),]

We show hereafter the statistics per sample (subset only).

	Q1	median	mean	Q3	max	libsum	perc10	perc90	perc95	zeros	percZeros
Snf26	124.0	506	1380	1140	210000	9812933	2392	2530	2526	467	6.55
Snf2	98.0	422	1220	960	200000	8694091	1977	2683	3066	552	7.75
Snf3	98.0	408	1230	954	210000	8773138	3316	4570	5280	533	7.48
WT8	80.0	360	1380	900	343000	9839118	1624	1738	1868	576	8.08
Snf43	119.0	476	1340	1060	219000	9568585	2	1	3	482	6.76
WT37	75.0	344	1550	901	468000	11009738	2	2	2	586	8.22
WT19	84.0	364	1310	879	311000	9326815	2	1	2	564	7.92
WT39	61.0	280	1040	689	247000	7408720	4056	5098	3496	627	8.80
WT10	80.2	360	1110	866	188000	7920887	2	2	2	547	7.68
Snf11	109.0	458	1390	1050	247000	9907483	1524	2802	2181	514	7.21

Interpretation of the stats per sample

In all samples, several hundreds of genes are completely undetected (zeros).
The huge difference between the central tendency (mean, median) and the maximum reflects the presence of outliers: a few genes have hundreds of thousands counts.
The mean count per genes show important fluctuations, indicating the need for normalisation.
The median is much lower than the mean, indicating right-skewed asymmetry.

Distributions

The summary only displays a few milestone values (mean, median, quartiles). In order to get a better intuition of the data, we can draw an histogram of all values.

par(mfrow=c(3,1))

hist(as.matrix(countTable), col="blue", border="white", breaks=100)

hist(as.matrix(countTable), col="blue", border="white",
     breaks=20000, xlim=c(0,2000), main="Counts per gene",
     xlab="Counts (truncated axis)", ylab="Number of genes", 
     las=1, cex.axis=0.7)

epsilon <- 1 # pseudo-count to avoid problems with log(0)
hist(as.matrix(log2(countTable + epsilon)), breaks=100, col="blue", border="white",
     main="Log2-transformed counts per gene", xlab="log2(counts+1)", ylab="Number of genes", 
     las=1, cex.axis=0.7)

Histogram of counts per genes. Top: raw counts. the scale is determined by the gene with the highest count, which is apparently an outlier. Middle: raw counts, with X axis truncated to 2000 in order to display a representative range despite outliers. Bottom: log2-transformed counts (bottom) per gene, with a pseudocount of 1 to avoid minus infinitevalues resulting from zero counts.

par(mfrow=c(1,1))

Interpretation

The top histogram is not very informative so far, apparently due to the presence of a few very high count values, that impose a very large scale on the $X$ axis.
The middle histogram shows the representative range. Note the height of the first bin, which includes the zero counts.
The logarithmic transformation (bottom histogram) improves the readability. Note that we added a pseudo-count of 1 to avoid problems with the log transformation of zero counts (which gives $-\infty$).

To get better insights into the distribution per sample, the boxplot offer a good perspective.

boxplot(log2(countTable + epsilon), col=phenoTable$color, pch=".", 
        horizontal=TRUE, cex.axis=0.5,
        las=1, ylab="Samples", xlab="log2(counts +1)")

Box plots of non-normalized log2(counts+1) per sample.

Another way to get an intuition of the value distributions is to use the plotDensity function, which draws one density curve for each sample.

# We will require one function from the affy package
if(!require("affy")){
  source("http://bioconductor.org/biocLite.R")
  biocLite("affy")  
}
library(affy)
plotDensity(log2(countTable + epsilon), lty=1, col=phenoTable$color, lwd=2)
grid()
legend("topright", legend=names(strainColor), col=strainColor, lwd=2)

Densities of log2(counts+1). Each curve corresponds to one sample.

Beware: the R function plotDensity does not display the actual distribution of your values, but a polynomial fit. The representation thus generally looks smoother than the actual data. It is important to realize that, in some particular cases, the fit can lead to extrapolated values which can be misleading.

Scatter plots

Let’s have a look at the scatter plots using the pairs function. We will only represent 10 randomly selected samples.

# Define a function to draw a scatter plot for a pair of variables (samples) with density colors
plotFun <- function(x,y){ 
  dns <- densCols(x,y); 
  points(x,y, col=dns, pch=".", panel.first=grid());  
#  abline(a=0, b=1, col="brown")
  }

# Plot the scatter plot for a few pairs of variables selected at random
set.seed(123) # forces the random number generator to produce fixed results
pairs(log2(countTable[,sample(ncol(countTable),5)] + epsilon), 
      panel=plotFun, lower.panel = NULL)

Scatter plot of log2-counts for a random selection of samples.

The command pairs draws a scatter plot for each pair of columns of the input dataset. The plot shows a fairly good reproducibility between samples of the same type (WT or KO, respectively): all points are aligned along te diagonal, with a relatively wider dispersion at the bottom, corresponding to small number fluctuations.

In contrast, on all the plots comparing a WT and a KO sample, we can see some points (genes) discarding from the diagonal.

Eliminating undetected genes

All genes from genome the S. cerevisiae where quantified. However only a fraction of them where expressed and some of them where to weakly expressed to be detected in any of the sample. As a result the count table may contain rows with only zero values (null counts).

Draw a barplot of the percentages of genes having null counts per sample.

View solution|Hide solution

propNull <- apply(countTable, 2, function(x){return(mean(x==0)*100)} )
print(head(propNull))

 WT1  WT2  WT3  WT4  WT5  WT6 
9.22 7.27 7.37 7.07 7.37 6.05

barplot(propNull, main="Percentage of null counts per sample", 
        horiz=TRUE, cex.names=0.5, las=1, 
        col=phenoTable$color, ylab='Samples', xlab='% of null counts')

Percentage of null counts per sample.

Some genes were not detected (0-count expression) in any of the samples. Count how much, and delete them from the countTable data frame.

View solution|Hide solution

length(rowSums(countTable) == 0)

[1] 7126

countTable <- countTable[rowSums(countTable) > 0,]

Selecting random samples

One of the questions that will drive the analysis will be to define the impact of the number of sample on the analysis.

The original study contained 48 replicates per genotype, what happens if we select a smaller number?

In a first time, we will analyse the full dataset with 48 samples per genotype.

In the second part of the tutorial, each attendee of this course will select a given number (e.g. 3, 4, 5, 10, 15, 20, 35, 40, 45, …) and adapt the code below run the analysis with that number of replicates per genotype.

We will at the end then compare the results (number of genes, significance, …) between the full dataset analysis, and the analysis of sub-samples of varying sizes.

We will already select here the subset of yeast samples for our sub-sampling analysis.

## Select a subset of samples from each genotype for the sub-sampling analysis

nb.replicates <- 10 # Each attendee chooses a number (3,4,5,10,15 or 20)

samples.WT <- sample(1:48, size=nb.replicates, replace=FALSE)

# Random sampling of the Snf2 replicates (columns 49 to 96)
samples.Snf2 <- sample(49:96, size=nb.replicates, replace=FALSE)

selected.samples <- c(samples.WT, samples.Snf2)

# Don't forget to update colors
col.pheno.selected <- phenoTable$color[selected.samples]

Differential analysis with DESeq2

In this section we will search for genes whose expression is affected by the genetic invalidation. You will first need to install the DESeq2 bioconductor library then load it.

# Install the library if needed then load it
if(!require("DESeq2")){
  install.packages("lazyeval")
  install.packages("ggplot2")
  
  source("http://bioconductor.org/biocLite.R")
  biocLite("DESeq2")
}
library("DESeq2")

Analysis of the full dataset

We will first analyse together the full dataset. The analysis of subsets of samples will be done as an exercise.

Creating a DESeqDataSet dataset

We will then create a DESeqDataSet using the DESeqDataSetFromMatrix function. Get some help about the DESeqDataSet and have a look at some important accessor methods: counts, conditions, estimateSizeFactors, sizeFactors, estimateDispersions and nbinomTest.

# Use the DESeqDataSetFromMatrix to create a DESeqDataSet object
dds0 <- DESeqDataSetFromMatrix(countData = countTable, colData = phenoTable, design = ~ strain)

## Check the general properties of the DESeq dataset
print(dds0)

class: DESeqDataSet 
dim: 6887 96 
metadata(1): version
assays(1): counts
rownames(6887): 15s_rrna 21s_rrna ... ty(gua)m2 ty(gua)o
rowData names(0):
colnames(96): WT1 WT2 ... Snf47 Snf48
colData names(2): strain color

# What kind of object is it ?
is(dds0)

[1] "DESeqDataSet"               "RangedSummarizedExperiment"
[3] "SummarizedExperiment"       "Vector"                    
[5] "Annotated"

isS4(dds0)

[1] TRUE

# What does it contain ?
# The list of slot names
slotNames(dds0)

[1] "design"             "dispersionFunction" "rowRanges"         
[4] "colData"            "assays"             "NAMES"             
[7] "elementMetadata"    "metadata"

# Get some help about the "CountDataSet" class.
# NOT RUN
#?"DESeqDataSet-class"

Normalization

The normalization procedure (RLE) is implemented through the estimateSizeFactors function.

How is the scaling factor computed ?

Note: this section is somewhat technical. If the underlying mathematics seem a bit too complicated do not worry, the next sections will be practice-oriented.

Given a matrix with $p$ columns (samples) and $n$ rows (genes) this function estimates the size factors as follows.

Each column element is divided by the geometric means of the rows.
For each sample, the median (or, if requested, another location estimator) of these ratios (skipping the genes with a geometric mean of zero) is used as the size factor for this column.

The scaling factor for sample $j$ is thus obtained as:

\[sf_{j} = median(\frac{K_{g,j}}{(\prod_{j=1}^p K_{g,j})^{1/p}})\]

# Let's implement such a function
# cds is a countDataset
estimSf <- function (cds){
    # Get the count matrix
    cts <- counts(cds)
    
    # Compute the geometric mean
    geomMean <- function(x) prod(x)^(1/length(x))

    # Compute the geometric mean over the line
    gm.mean  <-  apply(cts, 1, geomMean)
    
    # Zero values are set to NA (avoid subsequentcdsdivision by 0)
    gm.mean[gm.mean == 0] <- NA
    
    # Divide each line by its corresponding geometric mean
    # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
    # MARGIN: 1 or 2 (line or columns)
    # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
    # FUN: the function to be applied
    cts <- sweep(cts, 1, gm.mean, FUN="/")
    
    # Compute the median over the columns
    med <- apply(cts, 2, median, na.rm=TRUE)
    
    # Return the scaling factor
    return(med)
}

Now, check that the results obtained with our function are the same as those produced by DESeq. The method associated with normalization for the “CountDataSet” class is estimateSizeFactors.

# Normalizing using the method for an object of class"CountDataSet" 
dds.norm <-  estimateSizeFactors(dds0)
sizeFactors(dds.norm)

  WT1   WT2   WT3   WT4   WT5   WT6   WT7   WT8   WT9  WT10  WT11  WT12 
0.586 0.925 0.751 1.109 0.883 1.687 0.896 0.940 0.889 0.911 0.643 0.836 
 WT13  WT14  WT15  WT16  WT17  WT18  WT19  WT20  WT21  WT22  WT23  WT24 
1.022 1.100 0.782 0.550 0.885 0.878 0.939 0.631 1.053 1.444 0.870 0.910 
 WT25  WT26  WT27  WT28  WT29  WT30  WT31  WT32  WT33  WT34  WT35  WT36 
1.194 0.974 1.049 1.178 0.877 1.258 0.939 0.644 0.977 0.828 0.769 0.781 
 WT37  WT38  WT39  WT40  WT41  WT42  WT43  WT44  WT45  WT46  WT47  WT48 
0.922 0.647 0.714 1.277 1.063 0.803 0.739 0.691 1.166 0.604 1.272 0.854 
 Snf1  Snf2  Snf3  Snf4  Snf5  Snf6  Snf7  Snf8  Snf9 Snf10 Snf11 Snf12 
1.395 1.086 1.078 1.447 0.875 1.374 1.717 0.967 1.270 0.977 1.186 0.970 
Snf13 Snf14 Snf15 Snf16 Snf17 Snf18 Snf19 Snf20 Snf21 Snf22 Snf23 Snf24 
1.253 1.278 0.906 1.234 1.275 1.118 0.917 1.124 1.199 0.870 0.997 1.674 
Snf25 Snf26 Snf27 Snf28 Snf29 Snf30 Snf31 Snf32 Snf33 Snf34 Snf35 Snf36 
0.911 1.312 0.822 0.966 1.076 0.937 0.859 1.265 1.559 1.133 0.953 0.974 
Snf37 Snf38 Snf39 Snf40 Snf41 Snf42 Snf43 Snf44 Snf45 Snf46 Snf47 Snf48 
1.078 0.988 1.958 1.073 1.225 0.850 1.221 1.257 0.989 1.146 1.229 0.976

# Now get the scaling factor with our homemade function.cds.norm
head(estimSf(dds0))

  WT1   WT2   WT3   WT4   WT5   WT6 
0.543 0.884 0.725 1.062 0.841 1.603

all(round(estimSf(dds0),6) == round(sizeFactors(dds.norm), 6))

[1] FALSE

# Checking the normalization
par(mfrow=c(2,2),cex.lab=0.7)
boxplot(log2(counts(dds.norm)+epsilon),  col=col.pheno.selected, cex.axis=0.7, 
        las=1, xlab="log2(counts+1)", horizontal=TRUE, main="Raw counts")
boxplot(log2(counts(dds.norm, normalized=TRUE)+epsilon),  col=col.pheno.selected, cex.axis=0.7, 
        las=1, xlab="log2(normalized counts)", horizontal=TRUE, main="Normalized counts") 
plotDensity(log2(counts(dds.norm)+epsilon),  col=col.pheno.selected, 
            xlab="log2(counts+1)", cex.lab=0.7, panel.first=grid()) 
plotDensity(log2(counts(dds.norm, normalized=TRUE)+epsilon), col=col.pheno.selected, 
            xlab="log2(normalized counts)", cex.lab=0.7, panel.first=grid())

Impact of the count normalization.

# Restore default parameters
par(mfrow=c(1,1), cex.lab=1)

Count variance is related to mean

In the very first papers about RNA-seq, the authors assumed that the counts would follow a Poisson distribution, with a specific mean for each gene. However, this assumption has been revised because it was not supported by a simple analysis of the variance across samples.

As you can see from the following plot the relationship between variance and mean is not strictly linear. This can be shown by the poor fit that is obtained using a linear regression.

# Computing mean and variance
norm.counts <- counts(dds.norm, normalized=TRUE)
mean.counts <- rowMeans(norm.counts)
variance.counts <- apply(norm.counts, 1, var)

# sum(mean.counts==0) # Number of completely undetected genes

norm.counts.stats <- data.frame(
  min=apply(norm.counts, 2, min),
  mean=apply(norm.counts, 2, mean),
  median=apply(norm.counts, 2, median),
  max=apply(norm.counts, 2, max),
  zeros=apply(norm.counts==0, 2, sum),
  percent.zeros=100*apply(norm.counts==0, 2, sum)/nrow(norm.counts),
  perc05=apply(norm.counts, 2, quantile, 0.05),
  perc10=apply(norm.counts, 2, quantile, 0.10),
  perc90=apply(norm.counts, 2, quantile, 0.90),
  perc95=apply(norm.counts, 2, quantile, 0.95)
)

kable(norm.counts.stats)

	mean	median	max	zeros	percent.zeros	perc05	perc10	perc90	perc95
WT1	1479	405	322381	418	6.07	0.000	5.12	2576	4797
WT2	1239	420	212840	279	4.05	1.081	6.49	2134	3893
WT3	1245	418	229256	286	4.15	1.332	6.66	2220	3965
WT4	1365	412	296309	265	3.85	0.902	5.95	2369	4308
WT5	1317	426	255303	286	4.15	1.132	6.79	2212	3990
WT6	1259	418	211799	192	2.79	1.186	5.93	2307	3962
WT7	1435	416	291806	313	4.54	1.116	5.58	2468	4553
WT8	1520	406	365399	337	4.89	1.064	4.26	2541	4670
WT9	1446	410	301831	327	4.75	1.125	5.62	2461	4524
WT10	1263	417	206884	308	4.47	1.098	5.49	2311	4064
WT11	1536	402	398409	375	5.45	0.000	4.67	2573	4670
WT12	1427	421	307920	321	4.66	1.197	4.79	2391	4391
WT13	1290	420	223661	275	3.99	0.978	5.87	2316	4061
WT14	1268	418	230196	252	3.66	0.909	6.37	2320	4164
WT15	1415	409	313476	339	4.92	1.279	5.12	2442	4536
WT16	1470	418	315354	382	5.55	0.000	5.45	2416	4417
WT17	1178	425	192186	317	4.60	1.129	5.65	2287	3913
WT18	1273	415	222035	297	4.31	1.139	5.70	2335	4159
WT19	1443	409	331300	325	4.72	1.065	5.33	2462	4484
WT20	1447	407	330640	386	5.61	0.000	4.75	2479	4627
WT21	978	426	121351	294	4.27	0.950	5.70	2142	3464
WT22	1028	426	112264	236	3.43	0.693	5.54	2028	3360
WT23	1276	423	212780	296	4.30	1.150	5.75	2290	4070
WT24	1270	428	216705	274	3.98	1.099	5.50	2182	4020
WT25	1001	425	97074	235	3.41	0.837	5.86	2057	3443
WT26	1376	410	299042	298	4.33	1.027	5.13	2399	4431
WT27	1328	421	230908	274	3.98	0.954	5.72	2365	4182
WT28	994	426	96856	289	4.20	0.849	4.25	2062	3402
WT29	1351	424	276256	275	3.99	1.141	5.70	2237	4150
WT30	1334	416	264173	262	3.80	0.795	6.36	2427	4339
WT31	1331	416	224508	279	4.05	1.065	5.32	2377	4326
WT32	1475	409	375364	363	5.27	0.000	6.21	2535	4616
WT33	1412	423	280849	292	4.24	1.023	6.14	2308	4362
WT34	1500	391	380317	291	4.22	1.208	4.83	2168	4439
WT35	1257	418	211384	325	4.72	1.301	6.50	2320	4047
WT36	1460	406	317080	286	4.15	1.281	5.12	2150	4310
WT37	1733	399	506963	347	5.04	0.000	4.34	2689	5115
WT38	1617	402	418389	422	6.13	0.000	4.64	2678	5006
WT39	1506	417	345268	388	5.63	0.000	4.20	2512	4669
WT40	1356	410	280397	256	3.72	0.783	5.48	2422	4370
WT41	1362	419	261953	278	4.04	0.940	5.64	2352	4232
WT42	1323	412	260267	324	4.71	1.245	6.22	2349	4157
WT43	1302	415	259485	312	4.53	1.353	6.77	2307	4246
WT44	1704	397	460872	420	6.10	0.000	2.90	2774	5266
WT45	1441	416	290300	291	4.22	0.858	4.29	2450	4461
WT46	1342	411	280455	364	5.29	0.000	6.62	2337	4188
WT47	1379	421	274110	234	3.40	0.786	6.29	2368	4259
WT48	1360	415	285966	323	4.69	1.171	5.86	2389	4338
Snf1	1065	409	147523	216	3.14	0.717	7.17	1961	3333
Snf2	1163	409	184228	313	4.54	0.921	5.53	2110	3644
Snf3	1181	399	194840	294	4.27	0.927	5.56	2129	3701
Snf4	1070	411	141293	217	3.15	0.691	6.91	1998	3418
Snf5	1143	405	180918	308	4.47	1.143	6.40	2078	3635
Snf6	915	430	47246	312	4.53	0.728	4.37	2087	3173
Snf7	1055	409	141338	210	3.05	1.165	6.99	1984	3345
Snf8	1223	413	216228	309	4.49	1.034	5.17	2121	3772
Snf9	1146	411	165916	242	3.51	0.788	6.30	2059	3647
Snf10	1225	392	183204	315	4.57	1.024	5.12	2100	3896
Snf11	1213	406	208045	275	3.99	0.843	5.90	2148	3775
Snf12	1109	409	157447	253	3.67	1.031	7.22	1989	3524
Snf13	832	420	76553	249	3.62	0.798	6.38	1804	2748
Snf14	1080	412	135595	230	3.34	0.782	7.04	1993	3448
Snf15	1350	403	278461	343	4.98	1.104	5.52	2265	4178
Snf16	1068	421	147324	264	3.83	0.811	6.49	1959	3351
Snf17	1160	406	180320	235	3.41	0.784	6.27	2097	3639
Snf18	1086	415	143121	273	3.96	0.895	6.26	2012	3459
Snf19	1123	407	162718	304	4.41	1.090	6.54	2080	3630
Snf20	1176	417	190853	271	3.94	0.889	6.23	2100	3621
Snf21	1155	404	168416	266	3.86	0.834	5.84	2073	3670
Snf22	1124	409	161489	283	4.11	1.150	6.90	2065	3619
Snf23	1079	413	146678	280	4.07	1.003	7.02	2018	3470
Snf24	1209	396	184304	222	3.22	0.597	5.38	2096	3780
Snf25	1342	405	254687	334	4.85	1.098	4.39	2346	4231
Snf26	1086	406	160288	228	3.31	0.762	6.10	1991	3397
Snf27	1172	410	173441	333	4.83	1.216	6.08	2129	3785
Snf28	1154	409	186471	291	4.22	1.035	6.21	2065	3623
Snf29	1047	409	150413	257	3.73	0.929	7.43	1914	3307
Snf30	1120	414	165783	293	4.25	1.067	6.41	2037	3517
Snf31	1286	410	231820	330	4.79	1.165	4.66	2261	4039
Snf32	1081	408	154336	245	3.56	0.790	6.80	1993	3495
Snf33	1097	414	149625	214	3.11	0.641	6.41	2024	3472
Snf34	1145	409	171745	253	3.67	0.882	6.18	2057	3591
Snf35	972	419	101164	291	4.22	1.049	6.30	1943	3173
Snf36	1336	403	256044	328	4.76	1.027	5.13	2298	4239
Snf37	1075	419	141389	264	3.83	0.928	6.49	2003	3406
Snf38	1236	408	217389	325	4.72	1.012	5.06	2184	3851
Snf39	1126	410	165165	211	3.06	1.021	6.13	2081	3601
Snf40	1119	411	168374	267	3.88	0.932	5.59	2061	3515
Snf41	1180	411	195655	267	3.88	0.816	5.71	2116	3625
Snf42	1178	406	192517	348	5.05	0.000	5.88	2150	3756
Snf43	1138	407	179708	243	3.53	0.819	5.73	2046	3554
Snf44	1025	413	133228	230	3.34	0.795	7.16	1886	3176
Snf45	1154	412	179708	291	4.22	1.011	6.07	2071	3668
Snf46	1116	412	148491	260	3.77	0.873	6.11	2056	3580
Snf47	1035	418	130259	219	3.18	0.814	7.32	1902	3259
Snf48	1087	415	145916	250	3.63	1.024	7.17	2010	3535

# Mean and variance relationship
mean.var.col <- densCols(x=log2(mean.counts), y=log2(variance.counts))
plot(x=log2(mean.counts), y=log2(variance.counts), pch=16, cex=0.5, 
     col=mean.var.col, main="Mean-variance relationship",
     xlab="Mean log2(normalized counts) per gene",
     ylab="Variance of log2(normalized counts)",
     panel.first = grid())
abline(a=0, b=1, col="brown")

Figure: variance/mean plot. The brown line highlights $x=y$, which corresponds to the expected relationship between mean and variance for a Poisson distribution.

Interpretation

On the variance/mean plot, each dot corresponds to one gene, the abcsissa indicates the mean of the counts for this gene across all samples, and the ordinate its variance.

By definition, the Poisson distribution has a mean equal to its variance. In RNA-seq data, the variance is typically hihger than the mean. This distribution is said “over-dispersed” with respect to the Poisson. This over-dispersion is interpreted as an effect of the biological fluctuations between samples.

In order to model this over-dispersion, RNA-seq analysis relies on the negative binomial distribution.

Modeling read counts

Let us imagine that we would produce a lot of RNA-Seq experiments from the same samples (technical replicates). For each gene $g$ the measured read counts would be expected to vary rather slighlty around the expected mean and would be probably well modeled using a Poisson distribution. However, when working with biological replicates more variations are intrinsically expected. Indeed, the measured expression values for each genes are expected to fluctuate more importantly, due to the combination of biological and technical factors: inter-individual variations in gene regulation, sample purity, cell-synchronization issues or reponses to environment (e.g. heat-shock).

The Poisson distribution has only one parameter indicating its expected mean : $\lambda$. The variance of the distribution equals its mean $\lambda$. Thus in most cases, the Poisson distribution is not expected to fit very well with the count distribution in biological replicates, since we expect some over-dispersion (greater variability) due to biological noise.

As a consequence, when working with RNA-Seq data, many of the current approaches for differential expression call rely on an alternative distribution: the negative binomial (note that this holds true also for other -Seq approaches, e.g. ChIP-Seq with replicates).

What is the negative binomial ?

The negative binomial distribution is a discrete distribution that gives us the probability of observing $x$ failures before a given number of successes $n$ is obtained.

The negative binomial has a variance larger than its mean, and it therefore a popular way to model the typical over-dispersion of the RNA-seq data (in this case this overdispersion is relative to the poisson model).

The probability of $x$ failures before $n$ success

First, given a Bernouilli trial with a probability $p$ of success, the negative binomial distribution describes the probability of observing $x$ failures before a target number of successes $n$ is reached. In this case the parameters of the distribution will thus be $p$, $n$ (in dnbinom function of R, $n$ and $p$ are denoted by arguments size and prob respectively).

\[P_{NegBin}(x; n, p) = \binom{x+n-1}{x}\cdot p^n \cdot (1-p)^x = C^{x}_{x+n-1}\cdot p^n \cdot (1-p)^x \]

In this formula, $p^n$ denotes the probability to observe $n$ successes, $(1-p)^x$ the probability of $x$ failures, and the binomial coefficient $C^{x}_{x+n-1}$ indicates the number of possible ways to dispose $x$ failures among the $x+n-1$ trials that precede the last one (the problem statement imposes for the last trial to be a success).

The negative binomial distribution has expected value $n\frac{q}{p}$ where $q=1p$ is the probability of failure at each trial.

Its variance is $n\frac{q}{p^2}$ and is thus by definition smaller tha the mean (except in the trivial case where $p=1$). We show below some examples of this distribution.

Particular case: when $n=1$ the negative binomial corresponds to the the geometric distribution, which models the probability distribution to observe the first success after $x$ failures: $P_{NegBin}(x; 1, p) = P_{geom}(x; p) = p \cdot (1-p)^x$.

par(mfrow=c(3,2))

# Some intuition about the negative binomiale parametrized using n and p.
# The simple case, one success (see geometric distribution)
# Let's have a look at the density
p <- 1/6 # the probability of success
n <- 1   # target for number of successful trials
x <- 1:30

# the probability of zero failure before one success.
# i.e the probability of success 
dnbinom(0, n , p)

[1] 0.167

# Probability of at most 5 failure before the first success.
sum(dnbinom(0:5, n , p)) # == pnbinom(5, 1, p)

[1] 0.665

# The probability of at most 10 failures before one sucess 
sum(dnbinom(0:10, n , p)) # == pnbinom(10, 1, p)

[1] 0.865

# The probability to have more than 10 failures before one sucess
1-sum(dnbinom(0:10, n , p)) # == 1 - pnbinom(10, 1, p)

[1] 0.135

################################################################
# With two successes.

# The probability of x failure before two success (e.g. two six)
n <- 5
p <- 1/6 # the probability of success
for (n in c(1,2,5, 10, 20, 100)) {

  # Expected value
  q <- 1-p
  (ev <- n*q/p)
  
  # Variance 
  (v <- n*q/p^2)
  
  # Plot the negative binomial density  
  X <- 0:max(30, ceiling(2.5*ev))
  dX <- dnbinom(X, n, p)
  dX.pois <- dpois(X, ev)
  maxY <- max(dX, dX.pois)
  plot(X, dX, type="h", col="blue", lwd=2,
     main=paste("Negative binomial density, n =", n),
     ylab="P(x; n,p)", ylim=c(0, maxY*1.3), 
     xlab=paste("x = number of failures before", n, "successes"))

  # Plot the Poisson density
  lines(X, dX.pois, col="brown", lwd=1)

  # Plot the geometric density if n=1
  if (n==1) {
    dX.geom <- dgeom(X, prob=p)
    lines(X, dX.geom, col="darkviolet", lwd=1)
  }

  # Plot the expected value
  abline(v=ev, col="darkgreen", lwd=2)

  # Draw an arrow at +- 1stdev for negative binomial
  arrows(x0=ev-sqrt(v), y0=maxY*1.1,
         x1=ev+sqrt(v), y1=maxY*1.1, 
         col="darkblue",lwd=2, code=3, length=0.05, angle=20)

  # Draw an arrow at +- 1stdev for the Poisson
    arrows(x0=ev-sqrt(ev), y0=maxY*1.2,
         x1=ev+sqrt(ev), y1=maxY*1.2, 
         col="brown",lwd=2, code=3, length=0.05, angle=20)

    # Add a legend
    legend("topright", 
           lwd=2, 
           col=c("blue", "darkgreen", "darkblue", "brown"),
           c(paste("negbin, n =", n),
             paste("mean =", ev),
             paste("negbin sd =", round(digits=2, sqrt(v))),
             paste("Poisson sd =", round(digits=2, sqrt(ev)))
             ), cex=0.7
           )
           
}

$**Negative binomial density. ** Blue bars: negative binomial density, i.e. probability to obsere exactly $x$ failures before the $n^{th}$ success. The green vertical bar denotes the expected value (mean). Brown lines: Poisson density with the same expected value. Horizontal arrows indicate a 1 standard deviation range around the means, for the negative binomial (blue) and Poisson (red) distributins, respectively.$

Negative binomial density. Blue bars: negative binomial density, i.e. probability to obsere exactly $x$ failures before the $n^{th}$ success. The green vertical bar denotes the expected value (mean). Brown lines: Poisson density with the same expected value. Horizontal arrows indicate a 1 standard deviation range around the means, for the negative binomial (blue) and Poisson (red) distributins, respectively.

par(mfrow=c(1,1))

Interpretation

The negative binomial density (blue bars on the Figure) indicates the probability to observe exactly $x$ failures before the $n^{th}$ success, in a Bernoulli schema with probability of success $p$.

The negative binomial is over-dispersed relative to a Poisson distribution (brown line) with the same expected value (green vertical bar).

Using mean and dispersion

The second way of parametrizing the distribution is using the mean value $m$ and the dispersion parameter $r$ (in dnbinom function of R, $m$ and $r$ are denoted by arguments mu and size respectively). The variance of the distribution can then be computed as $m + m^2/r$.

Note that $m$ can be deduced from $n$ and $p$.

n <- 10
p <- 1/6
q <- 1-p
mu <- n*q/p

all(dnbinom(0:100, mu=mu, size=n) == dnbinom(0:100, size=n, prob=p))

[1] TRUE

Modelling read counts through a negative binomial

To perform diffential expression call DESeq will assume that, for each gene, the read counts are generated by a negative binomial distribution. One problem here will be to estimate, for each gene, the two parameters of the negative binomial distribution: mean and dispersion.

The mean will be estimated from the observed normalized counts in both conditions.
The first step will be to compute a gene-wise dispersion. When the number of available samples is insufficient to obtain a reliable estimator of the variance for each gene, DESeq will apply a shrinkage strategy, which assumes that counts produced by genes with similar expression level (counts) have similar variance (note that this is a strong assumption). DESeq will regress the gene-wise dispersion onto the means of the normalized counts to obtain an estimate of the dispersion that will be subsequently used to build the binomial model for each gene.

# Performing estimation of dispersion parameter
dds.disp <- estimateDispersions(dds.norm)

# A diagnostic plot which
# shows the mean of normalized counts (x axis)
# and dispersion estimate for each genes
plotDispEsts(dds.disp)

Performing differential expression call

Now that a negative binomial model has been fitted for each gene, the nbinomWaldTest can be used to test for differential expression. The output is a data.frame which contains nominal p-values, as well as FDR values (correction for multiple tests computed with the Benjamini–Hochberg procedure).

alpha <- 0.0001
waldTestResult <- nbinomWaldTest(dds.disp)
resultDESeq2 <- results(waldTestResult, alpha=alpha, pAdjustMethod="BH")

# What is the object returned by nbinomTest()
class(resultDESeq2)

[1] "DESeqResults"
attr(,"package")
[1] "DESeq2"

is(resultDESeq2) # a data.frame

[1] "DESeqResults"    "DataFrame"       "DataTable"       "SimpleList"     
[5] "DataTableORNULL" "List"            "Vector"          "Annotated"

slotNames(resultDESeq2)

[1] "rownames"        "nrows"           "listData"        "elementType"    
[5] "elementMetadata" "metadata"

head(resultDESeq2)

log2 fold change (MAP): strain WT vs Snf 
Wald test p-value: strain WT vs Snf 
DataFrame with 6 rows and 6 columns
          baseMean log2FoldChange     lfcSE      stat        pvalue
         <numeric>      <numeric> <numeric> <numeric>     <numeric>
15s_rrna     18.34        -0.1453     0.291    -0.499 0.61806723393
21s_rrna    107.33         0.0843     0.252     0.335 0.73780434012
hra1          2.53         0.7432     0.208     3.569 0.00035825052
icr1        141.57        -0.2149     0.037    -5.816 0.00000000601
lsr1        207.53         0.1322     0.153     0.864 0.38740549424
nme1         26.48        -0.0823     0.131    -0.630 0.52852462875
                 padj
            <numeric>
15s_rrna 0.6738989809
21s_rrna 0.7826231668
hra1     0.0006499904
icr1     0.0000000163
lsr1     0.4497721567
nme1     0.5894985211

# The column names of the data.frame
# Note the column padj 
# contains FDR values (computed Benjamini–Hochberg procedure)
colnames(resultDESeq2)

[1] "baseMean"       "log2FoldChange" "lfcSE"          "stat"          
[5] "pvalue"         "padj"

# Order the table by decreasing p-valuer
resultDESeq2 <- resultDESeq2[order(resultDESeq2$padj),]
head(resultDESeq2)

log2 fold change (MAP): strain WT vs Snf 
Wald test p-value: strain WT vs Snf 
DataFrame with 6 rows and 6 columns
           baseMean log2FoldChange     lfcSE      stat    pvalue      padj
          <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
yar009c        1283           2.43    0.0464      52.3         0         0
yar071w        1674           4.24    0.0472      89.7         0         0
ybl005w         542           1.24    0.0261      47.7         0         0
ybl005w-b      1449           2.17    0.0389      55.7         0         0
ybl100w-b       454           1.23    0.0289      42.5         0         0
ybr012w-b      1199           2.01    0.0414      48.5         0         0

# Draw an histogram of the p-values
h1 <- hist(resultDESeq2$pvalue, breaks=20, col="grey", main="DESeq2 p-value distribution", xlab="Nominal P-value", ylab="Number of genes")

# Plot an horizontal bar with the estimation of m0 according to Storey-Tibshirani method (2003)
m <- sum(h1$counts) # Number of genes with a p-value
m0 <- 2*(sum(as.vector(na.omit(resultDESeq2$pvalue)) >0.5)) # Estimation of the number of genes under H0
m1 <- m - m0 # Estimation of the number of genes under m1
nbins <- length(h1$counts)
abline(v=m0/nbins, col="blue")

Histogram of the p-values reported by DESeq2. Note that we display here the nominal p-values (not corrected for multiple testing), which would have an uniform distribution if all genes were under $H0$.

Interpretation of the p-value

For each gene $i$ we tested the hypothesis of equality between the mean levels of expressions for Snf2 ($\mu_{i,Sbf2}$) and wild-type (($\mu_{i,WT}$) strains.

The null hypothesis is that the Snf2 and WT samples were drawn from populations having the same mean for the considered gene.

\[H_0: \mu_{i,Snf2} = \mu_{i,WT}\]

We computed the mean and dispersion of the read counts in $n=48$ replicates per condition. We measure the difference between the means per condition for these samples, and we need to estimate their significance relative to the dispersion. Indeed, a same change in expression (e.g. a fold-change of 1.5) can be significant if we have a large number of replicates and if the counts per gene are highly reproducible between samples of the same condition (i.e. if their dispersion per sample is small). However, the same fold-change might result from the sampling effects if it is observed for a gene with high fluctuations between replicates, or if our number of replicates is too small to achieve a reliable estimation of the mean and standard deviation.

The goal of the test of hypothesis is precisely to estimate the significance of the observed effect (the difference between log2 of the normalized counts, which corresponds to the log2 fold change), whilst taking into account the dispersion of the measurements across replicates.

The p-value indicates the probability to obtain, under the null hypothesis, an effect at least as extreme as the measured effect. This p-value can be interpreted as a False Positive Risk (FPR), i.e. the risk to declare a gene significant whereas it is expressed at the same level in the Snf2 and WT strains.

Here we analyzed 6 887 genes, and tested the null hypothesis for each of them.

Volcano plot

alpha <- 0.01 # Threshold on the p-value

# par(mfrow=c(1,2))

# Compute significance, with a maximum of 320 for the p-values set to 0 due to limitation of computation precision
resultDESeq2$sig <- -log10(resultDESeq2$padj)
sum(is.infinite(resultDESeq2$sig))

[1] 91

resultDESeq2[is.infinite(resultDESeq2$sig),"sig"] <- 350
# View(resultDESeq2[is.na(resultDESeq2$pvalue),])

# Select genes with a defined p-value (DESeq2 assigns NA to some genes)
genes.to.plot <- !is.na(resultDESeq2$pvalue)
# sum(genes.to.plot)
range(resultDESeq2[genes.to.plot, "log2FoldChange"])

[1] -3.62  7.54

# View(resultDESeq2[genes.to.plot,])

## Volcano plot of adjusted p-values
cols <- densCols(resultDESeq2$log2FoldChange, resultDESeq2$sig)
cols[resultDESeq2$pvalue ==0] <- "purple"
resultDESeq2$pch <- 19
resultDESeq2$pch[resultDESeq2$pvalue ==0] <- 6
plot(resultDESeq2$log2FoldChange, 
     resultDESeq2$sig, 
     col=cols, panel.first=grid(),
     main="Volcano plot", 
     xlab="Effect size: log2(fold-change)",
     ylab="-log10(adjusted p-value)",
     pch=resultDESeq2$pch, cex=0.4)
abline(v=0)
abline(v=c(-1,1), col="brown")
abline(h=-log10(alpha), col="brown")

## Plot the names of a reasonable number of genes, by selecting those begin not only significant but also having a strong effect size
gn.selected <- abs(resultDESeq2$log2FoldChange) > 2 & resultDESeq2$padj < alpha 
text(resultDESeq2$log2FoldChange[gn.selected],
     -log10(resultDESeq2$padj)[gn.selected],
     lab=rownames(resultDESeq2)[gn.selected ], cex=0.6)

$**Volcano plot of DESeq2 results. ** Abcsissa: effect size, estimated a the log2(fold-change) between Snf2 and WT mutant. Ordinate: significance ($-log_{10}(adjusted P-value)$).$

Volcano plot of DESeq2 results. Abcsissa: effect size, estimated a the log2(fold-change) between Snf2 and WT mutant. Ordinate: significance ($-log_{10}(adjusted P-value)$).

# par(mfrow=c(1,1))

Interpretation of the volcano plot

The volcano plot enables to simultaneously capture the effect size (abcsissa) and significance (ordinate) of each tested gene.

Log2 fold-changes

For RNA-seq transcriptome data, the effect size is estimated by the log2 fold-change ($log2FC$), i.e. the logarithm in base 2 of the ratio between normalized counts in the test and control conditions.

Positive $X$ values denote genes over-expressed in the test relative to the control.

Negative values denote genes under-expressed in the test relative to the control.

Effect size ($logFC$)	Fold-change	Interpretation
4	$FC=2^4 = 16$	this gene has a 16 times higher expression level in test than in control conditions
-4	$FC=2^{-4} = 0.0625$	expression level 16 times lower in test than in control
1	$FC=2^1=2$	expression level 2 times lower in test than in control
0	$FC=2^0=1$	same expression level in test and control

Significance

The significance is the minus log10 of the p-value.

Check the expression levels of the most differentially expressed gene

It may be important to check the validity of our analysis by simply assessing the expression level of the most highly differential gene.

selectedGenes <- c(
  "Most significant" =  rownames(resultDESeq2)[which.max(resultDESeq2$sig)])

  #  top.logFC = rownames(resultDESeq2)[which.max(resultDESeq2$log2FoldChange)])
#gn.most.sign <- rownames(resultDESeq2)[1]
#gn.most.diff.val <- counts(dds.norm, normalized=T)[gn.most.sign,]

## Select a gene with small fold change but high significance
sel1 <- subset(
  na.omit(resultDESeq2), 
  sig >= 50 & log2FoldChange > 0 & log2FoldChange < 0.5)
# dim(sel1)
selectedGenes <- append(selectedGenes, 
                         c("Small FC yet significant"=rownames(sel1)[1]))

## Select the non-significant gene with the highest fold change
sel2 <- subset(x = na.omit(resultDESeq2), padj > alpha & log2FoldChange > 0 & baseMean > 1000 & baseMean < 10000)
# dim(sel2)
sel2 <- sel2[order(sel2$log2FoldChange, decreasing = TRUE),][1,]
selectedGenes <- append(
  selectedGenes, 
  c("Non-significant"=rownames(sel2)[1]))

par(mfrow=c(length(selectedGenes),1))
for (g in selectedGenes) {
  barplot(counts(dds.norm, normalized=TRUE)[g,], 
          col=phenoTable$color, 
          main=g, las=2, cex.names=0.5)
}

Barplots of counts per sample for a few selected genes. Top: highly significant gene with an important fold change. Middle: very significant gene with a relatively low fold change. Bottom: non-significant gene.

par(mfrow=c(1,1))

Looking at the results with a MA plot

One popular diagram in dna chip analysis is the M versus A plot (MA plot) between two conditions $a$ and $b$. In this representation :

M (Minus) is the log ratio of counts calculated for any gene. \[M_g = log2(\bar{x}_{g,a}) - log2(\bar{x}_{g,b})\]
A (add) is the average log counts which corresponds to an estimate of the gene expression level. \[A_g = \frac{1}{2}(log2(\bar{x}_g,a) + log2(\bar{x}_g,b))\]

# Draw a MA plot.
# Genes with adjusted p-values below 1% are shown
plotMA(resultDESeq2, colNonSig = "blue", main="MA plot")
abline(h=c(-1:1), col="red")

MA plot. The abcsissa indicates the mean of normalized counts; the ordinate the log2(fold-change).

Hierarchical clustering

To ensure that the selected genes distinguish well between “treated”" and “untreated” condition we will perform a hierachical clustering using the heatmap.2 function from the gplots library.

# We select gene names based on FDR (1%)
gene.kept <- rownames(resultDESeq2)[resultDESeq2$padj <= alpha & !is.na(resultDESeq2$padj)]

# We retrieve the normalized counts for gene of interest
countTable.kept <- log2(countTable + epsilon)[gene.kept, ]
dim(countTable.kept)

[1] 4374   96

# Install the gplots library if needed then load it
if(!require("gplots")){
  install.packages("gplots")
}
library("gplots")

# Perform the hierarchical clustering with
# A distance based on Pearson-correlation coefficient
# and average linkage clustering as agglomeration criteria
heatmap.2(as.matrix(countTable.kept), 
          scale="row", 
          hclust=function(x) hclust(x,method="average"), 
          distfun=function(x) as.dist((1-cor(t(x)))/2), 
          trace="none", 
          density="none", 
          labRow="",
          cexCol=0.7)

Heatmap of the gebes deckared significant with DESeq2. Rows correspond to genes, columns to samples.

Functional enrichment

We will now perform functional enrichment using the list of induced genes. This step will be performed using the gProfileR R library.

## JvH: I TEMPORARILY DISACTIVATE THIS SECTIO BECAUSE I HAVE NO INTERNET CONNECTION

library(gProfileR)

resultDESeq2.df <- na.omit(data.frame(resultDESeq2))
induced.sign <- rownames(resultDESeq2.df)[resultDESeq2.df$log2FoldChange >= 2 &  resultDESeq2.df$padj < alpha]
# head(induced.sign)
# names(term.induced)

term.induced <- gprofiler(query=induced.sign, organism="scerevisiae")
term.induced <- term.induced[order(term.induced$p.value),]
# term.induced$p.value
kable(term.induced[1:10,c("term.name",
                      "term.size",
                      "query.size",
                      "overlap.size",
                      "recall",
                      "precision",
                      "p.value", 
                      "intersection")], 
      format.args=c(engeneer=TRUE, digits=3), caption="**Table: functional analysis wit gProfileR. ** ")

And now using the list of repressed genes.

resultDESeq2.df <- na.omit(data.frame(resultDESeq2))
repressed.sign <- rownames(resultDESeq2.df)[resultDESeq2.df$log2FoldChange <= -2 &  resultDESeq2.df$padj < alpha]
head(repressed.sign)

term.repressed <- gprofiler(query=repressed.sign, organism="scerevisiae")
term.repressed <- term.repressed[order(term.repressed$p.value),]
kable(head(term.induced[,c("p.value", "term.name","intersection")], 10))

Exercise: assess the effect of sample number on differential expression call

Using a loop, randomly select 10 times 2,5,10,15..45 samples from WT and Snf2 KO. Perform differential expression calls and draw a diagram showing the number of differential expressed genes.

## Local paths: create a directory to store the results
resultDir <- ("~/jgb71e_practicals/rnaseq_snf2_Schurch_2015/results")
dir.create(resultDir, showWarnings = FALSE, recursive = TRUE)
setwd(resultDir)



# Export the table with statistics per sample.
write.table(statsPerSample, file=file.path(resultDir, "stats_per_sample.tsv"),
            quote=FALSE, sep="\t", col.names =NA, row.names = TRUE)

# Export the DESeq2 result table
DESeq2Table <- file.path(resultDir, "yeast_Snf2_vs_WT_DESeq2_diff.tsv")
write.table(resultDESeq2, file=DESeq2Table, col.names = NA, row.names = TRUE, sep="\t", quote = FALSE)

RNA-Seq - differential expression using DESeq2

Nicolas Terrapon, Denis Puthier and Jacques van Helden (adapted from tutorial from Hugo Varet, Julie Aubert and J. van Helden)

First version: 2016-12-10; Last update: 2017-01-22

The Snf2 dataset

Loading the dataset

Exercice

Descriptive statistics

Basic statistics

Exercice

Interpretation of the stats per sample

Distributions

Interpretation

Scatter plots

Eliminating undetected genes

Selecting random samples

Differential analysis with DESeq2

Analysis of the full dataset

Creating a DESeqDataSet dataset

Normalization

How is the scaling factor computed ?

Modeling read counts

What is the negative binomial ?

The probability of \(x\) failures before \(n\) success

Interpretation

Using mean and dispersion

Modelling read counts through a negative binomial

Performing differential expression call

Interpretation of the p-value

Volcano plot

Interpretation of the volcano plot

Log2 fold-changes

Significance

Check the expression levels of the most differentially expressed gene

Looking at the results with a MA plot

Hierarchical clustering

Functional enrichment

Exercise: assess the effect of sample number on differential expression call

Effect size (\(logFC\))	Fold-change	Interpretation
4	\(FC=2^4 = 16\)	this gene has a 16 times higher expression level in test than in control conditions
-4	\(FC=2^{-4} = 0.0625\)	expression level 16 times lower in test than in control
1	\(FC=2^1=2\)	expression level 2 times lower in test than in control
0	\(FC=2^0=1\)	same expression level in test and control