The RNA-Seq dataset we will use in this practical has been produced by Gierliński et al ([@pmid26206307, @pmid27022035]). The dataset is composed of 48 WT yeast samples vs 48 Snf2 knock-out mutant cell line. The prepared RNA-Seq libraries (unstranded) were pooled and sequenced on seven lanes of a single flow-cell on an Illumina HiSeq 2000 resulting in a total of 1 billion 50-bp single-end reads across the 96 samples. RNA-Seq reads have been cleaned, mapped and counted to generated a count data matrix containing 7126 rows/genes. The primary objective of this study was to check whether the observed gene read counts distribution where consistent with theorical models (e.g. negative binomial). More information can be obtained in the original paper (pdf)
R enables to download data directly from the Web. The expression matrix and phenotypic information will be loaded into R using the read.table function. Both table will be converted into a data.frame object when loaded into R. The ‘count.table’ object will contains counts for each gene (row) and each sample (column).
# Download data files from the Web site (only if not done yet)
<- "http://jvanheld.github.io/stats_avec_RStudio_EBA/practicals/yeast_2x48_replicates/data/"
url.counts
## Local paths: create a directory to store the results
<- ("~/ASG/practicals/rnaseq_snf2_Schurch_2015")
dir.snf2 <- file.path(dir.snf2, "data")
dir.counts <- file.path(dir.counts, "counts.txt")
file.counts <- file.path(dir.counts, "expDesign.txt")
file.expDesign
## Create a directory to download the dataset if it does not exist
dir.create(dir.counts, showWarnings = FALSE, recursive = TRUE)
## Download the data files if required
if (!file.exists(file.counts)) {
message("Downloading count table from ", url.counts)
download.file(url=file.path(url.counts, "counts.txt"), destfile = file.counts)
}if (!file.exists(file.expDesign)) {
message("Downloading design table from ", url.counts)
download.file(url=file.path(url.counts, "expDesign.txt"), destfile = file.expDesign)
}
# Load the count table
<- read.table(file=file.counts, sep="\t", header=TRUE, row.names=1)
count.table # View(count.table)
The dataset contains RNA-Seq count data for a wild type strain (WT) and a Snf2 mutant, with 48 biological replicates for each genotype.
All phenotypic informations are enclosed in a dedicated file. Note that the produced data.frame encodes the ‘strains’ columns as a factor1.
# Load experimental design file
<- read.table(file=file.expDesign, sep="\t", header=TRUE)
expDesign #View(expDesign)
# Check the first and last line of the phenotypic data
head(expDesign)
label strain
1 WT1 WT
2 WT2 WT
3 WT3 WT
4 WT4 WT
5 WT5 WT
6 WT6 WT
tail(expDesign)
label strain
91 Snf43 Snf
92 Snf44 Snf
93 Snf45 Snf
94 Snf46 Snf
95 Snf47 Snf
96 Snf48 Snf
## Count the number of sample in each class
table(expDesign$strain)
Snf WT
48 48
## Define a strain-specific color for each sample,
## and add it as a supplementary column to the phenotypic data
<- c(WT="green", Snf="orange") # Choose one color per strain
col.strain $color <- col.strain[as.vector(expDesign$strain)] expDesign
colSums()
or the apply()
function
(run help(colSums()
if you don’t know this function).Before going further in the analysis, we will compute some descriptive statistics on the dataset. At this stage we only compute statistics per sample, since statistics per gene are meaningful only after library-wise normalization of the counts.
## Dimensions
<- ncol(count.table)
nb.samples print(nb.samples)
[1] 96
<- nrow(count.table)
nb_genes print(nb_genes)
[1] 7126
dim(count.table)
[1] 7126 96
## Min, Max, median (...).
## Here on the first 4 samples
head(summary(count.table[,1:4]))
WT1 WT2 WT3 WT4
Min. : 0 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 49 1st Qu.: 88 1st Qu.: 74 1st Qu.: 104
Median : 224 Median : 371 Median : 297 Median : 430
Mean : 837 Mean : 1107 Mean : 903 Mean : 1464
3rd Qu.: 561 3rd Qu.: 853 3rd Qu.: 670 3rd Qu.: 1019
Max. :188825 Max. :196804 Max. :172119 Max. :328674
## A magic trick to convert column-wise summaries into a data.frame.
## The do.call() function produces a data frame with one col per sample,
## we transpose it to obtain one row per sample and one column per statistics.
<- data.frame(t(do.call(cbind, lapply(count.table, summary))))
stats.per.sample head(stats.per.sample)
Min. X1st.Qu. Median Mean X3rd.Qu. Max.
WT1 0 49.0 224 837 561 188825
WT2 0 88.0 371 1107 853 196804
WT3 0 74.2 297 903 670 172119
WT4 0 104.2 430 1464 1019 328674
WT5 0 84.0 353 1124 819 225435
WT6 0 153.0 667 2052 1599 357247
## We can now add some columns to the stats per sample
$libsum <- apply(count.table, 2, sum) ## libsum
stats.per.sample# Add some percentiles
$perc05 <- apply(count.table, 2, quantile, 0.05)
stats.per.sample$perc10 <- apply(count.table, 2, quantile, 0.10)
stats.per.sample$perc90 <- apply(count.table, 2, quantile, 0.90)
stats.per.sample$perc95 <- apply(count.table, 2, quantile, 0.95)
stats.per.sample$zeros <- apply(count.table==0, 2, sum)
stats.per.sample$percent.zeros <- 100*stats.per.sample$zeros/nrow(count.table)
stats.per.sample
# View(stats.per.sample)
kable(stats.per.sample[sample(1:ncol(count.table), size = 10),],
caption = "**Table: statistics per sample. ** We only display a random selection of 10 samples. ")
Min. | X1st.Qu. | Median | Mean | X3rd.Qu. | Max. | libsum | perc05 | perc10 | perc90 | perc95 | zeros | percent.zeros | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Snf4 | 0 | 137.0 | 562 | 1496 | 1268 | 204494 | 10662188 | 0 | 3.0 | 2802 | 4739 | 456 | 6.40 |
WT16 | 0 | 50.0 | 216 | 782 | 526 | 173478 | 5569418 | 0 | 1.0 | 1284 | 2388 | 621 | 8.71 |
WT12 | 0 | 76.0 | 332 | 1152 | 802 | 257336 | 8212630 | 0 | 1.0 | 1920 | 3536 | 560 | 7.86 |
WT23 | 0 | 79.0 | 345 | 1072 | 822 | 185052 | 7640884 | 0 | 2.0 | 1934 | 3416 | 535 | 7.51 |
Snf46 | 0 | 108.0 | 450 | 1236 | 1005 | 170179 | 8808358 | 0 | 2.0 | 2272 | 3984 | 499 | 7.00 |
Snf17 | 0 | 126.0 | 493 | 1429 | 1104 | 229969 | 10184815 | 0 | 2.5 | 2598 | 4460 | 474 | 6.65 |
Snf33 | 0 | 146.0 | 616 | 1653 | 1368 | 233323 | 11776986 | 0 | 3.0 | 3066 | 5280 | 453 | 6.36 |
Snf43 | 0 | 119.0 | 476 | 1343 | 1057 | 219417 | 9568585 | 0 | 2.0 | 2456 | 4182 | 482 | 6.76 |
Snf28 | 0 | 92.2 | 374 | 1078 | 850 | 180213 | 7681488 | 0 | 2.0 | 1957 | 3413 | 530 | 7.44 |
WT38 | 0 | 54.0 | 245 | 1010 | 620 | 270499 | 7197961 | 0 | 1.0 | 1682 | 3078 | 661 | 9.28 |
The summary only displays a few milestone values (mean, median, quartiles). In order to get a better intuition of the data, we can draw an histogram of all values.
par(mfrow=c(3,1))
hist(as.matrix(count.table), col="blue", border="white", breaks=100)
hist(as.matrix(count.table), col="blue", border="white",
breaks=20000, xlim=c(0,2000), main="Counts per gene",
xlab="Counts (truncated axis)", ylab="Number of genes",
las=1, cex.axis=0.7)
<- 1 # pseudo-count to avoid problems with log(0)
epsilon hist(as.matrix(log2(count.table + epsilon)), breaks=100, col="blue", border="white",
main="Log2-transformed counts per gene", xlab="log2(counts+1)", ylab="Number of genes",
las=1, cex.axis=0.7)