The Snf2 dataset

The RNA-Seq dataset we will use in this practical has been produced by Gierliński et al ([@pmid26206307, @pmid27022035]). The dataset is composed of 48 WT yeast samples vs 48 Snf2 knock-out mutant cell line. The prepared RNA-Seq libraries (unstranded) were pooled and sequenced on seven lanes of a single flow-cell on an Illumina HiSeq 2000 resulting in a total of 1 billion 50-bp single-end reads across the 96 samples. RNA-Seq reads have been cleaned, mapped and counted to generated a count data matrix containing 7126 rows/genes. The primary objective of this study was to check whether the observed gene read counts distribution where consistent with theorical models (e.g. negative binomial). More information can be obtained in the original paper (pdf)

Loading the dataset

R enables to download data directly from the Web. The expression matrix and phenotypic information will be loaded into R using the read.table function. Both table will be converted into a data.frame object when loaded into R. The ‘count.table’ object will contains counts for each gene (row) and each sample (column).

# Download data files from the Web site (only if not done yet)
url.counts <- "http://jvanheld.github.io/stats_avec_RStudio_EBA/practicals/yeast_2x48_replicates/data/"

## Local paths: create a directory to store the results
dir.snf2 <- ("~/ASG/practicals/rnaseq_snf2_Schurch_2015")
dir.counts <- file.path(dir.snf2, "data")
file.counts <- file.path(dir.counts, "counts.txt")
file.expDesign <- file.path(dir.counts, "expDesign.txt")

## Create a directory to download the dataset if it does not exist
dir.create(dir.counts, showWarnings = FALSE, recursive = TRUE)

## Download the data files if required
if (!file.exists(file.counts)) {
  message("Downloading count table from ", url.counts)
  download.file(url=file.path(url.counts, "counts.txt"), destfile = file.counts)
}
if (!file.exists(file.expDesign)) {
  message("Downloading design table from ", url.counts)
  download.file(url=file.path(url.counts, "expDesign.txt"), destfile = file.expDesign)
}

# Load the count table
count.table <- read.table(file=file.counts, sep="\t", header=TRUE, row.names=1)
# View(count.table)

Phenotypic data

The dataset contains RNA-Seq count data for a wild type strain (WT) and a Snf2 mutant, with 48 biological replicates for each genotype.

All phenotypic informations are enclosed in a dedicated file. Note that the produced data.frame encodes the ‘strains’ columns as a factor1.

# Load experimental design file
expDesign <- read.table(file=file.expDesign, sep="\t", header=TRUE)
#View(expDesign)

# Check the first and last line of the phenotypic data
head(expDesign)
  label strain
1   WT1     WT
2   WT2     WT
3   WT3     WT
4   WT4     WT
5   WT5     WT
6   WT6     WT
tail(expDesign)
   label strain
91 Snf43    Snf
92 Snf44    Snf
93 Snf45    Snf
94 Snf46    Snf
95 Snf47    Snf
96 Snf48    Snf
## Count the number of sample in each class
table(expDesign$strain)

Snf  WT 
 48  48 
## Define a strain-specific color for each sample,
## and add it as a supplementary column to the phenotypic data
col.strain <- c(WT="green", Snf="orange") # Choose one color per strain
expDesign$color <- col.strain[as.vector(expDesign$strain)]
  • Draw a barplot showing the number of reads in each sample. Use either the colSums() or the apply() function (run help(colSums() if you don’t know this function).
  • What can you say from this diagram?

View solution| Hide solution

Descriptive statistics

Basic statistics

Before going further in the analysis, we will compute some descriptive statistics on the dataset. At this stage we only compute statistics per sample, since statistics per gene are meaningful only after library-wise normalization of the counts.

## Dimensions
nb.samples <- ncol(count.table)
print(nb.samples)
[1] 96
nb_genes <- nrow(count.table)
print(nb_genes)
[1] 7126
dim(count.table)
[1] 7126   96
## Min, Max, median (...). 
## Here on the first 4 samples
head(summary(count.table[,1:4]))
      WT1              WT2              WT3              WT4        
 Min.   :     0   Min.   :     0   Min.   :     0   Min.   :     0  
 1st Qu.:    49   1st Qu.:    88   1st Qu.:    74   1st Qu.:   104  
 Median :   224   Median :   371   Median :   297   Median :   430  
 Mean   :   837   Mean   :  1107   Mean   :   903   Mean   :  1464  
 3rd Qu.:   561   3rd Qu.:   853   3rd Qu.:   670   3rd Qu.:  1019  
 Max.   :188825   Max.   :196804   Max.   :172119   Max.   :328674  
## A magic trick to convert column-wise summaries into a data.frame.
## The do.call() function produces a data frame with one col per sample, 
## we transpose it to obtain one row per sample and one column per statistics.
stats.per.sample <- data.frame(t(do.call(cbind, lapply(count.table, summary))))
head(stats.per.sample)
    Min. X1st.Qu. Median Mean X3rd.Qu.   Max.
WT1    0     49.0    224  837      561 188825
WT2    0     88.0    371 1107      853 196804
WT3    0     74.2    297  903      670 172119
WT4    0    104.2    430 1464     1019 328674
WT5    0     84.0    353 1124      819 225435
WT6    0    153.0    667 2052     1599 357247
## We can now add some columns to the stats per sample
stats.per.sample$libsum <- apply(count.table, 2, sum) ## libsum
# Add some percentiles
stats.per.sample$perc05 <- apply(count.table, 2, quantile, 0.05)
stats.per.sample$perc10 <- apply(count.table, 2, quantile, 0.10)
stats.per.sample$perc90 <- apply(count.table, 2, quantile, 0.90)
stats.per.sample$perc95 <- apply(count.table, 2, quantile, 0.95)
stats.per.sample$zeros <- apply(count.table==0, 2, sum)
stats.per.sample$percent.zeros <- 100*stats.per.sample$zeros/nrow(count.table)

# View(stats.per.sample)
kable(stats.per.sample[sample(1:ncol(count.table), size = 10),],
      caption = "**Table: statistics per sample. ** We only display a random selection of 10 samples. ")
Table: statistics per sample. We only display a random selection of 10 samples.
Min. X1st.Qu. Median Mean X3rd.Qu. Max. libsum perc05 perc10 perc90 perc95 zeros percent.zeros
Snf4 0 137.0 562 1496 1268 204494 10662188 0 3.0 2802 4739 456 6.40
WT16 0 50.0 216 782 526 173478 5569418 0 1.0 1284 2388 621 8.71
WT12 0 76.0 332 1152 802 257336 8212630 0 1.0 1920 3536 560 7.86
WT23 0 79.0 345 1072 822 185052 7640884 0 2.0 1934 3416 535 7.51
Snf46 0 108.0 450 1236 1005 170179 8808358 0 2.0 2272 3984 499 7.00
Snf17 0 126.0 493 1429 1104 229969 10184815 0 2.5 2598 4460 474 6.65
Snf33 0 146.0 616 1653 1368 233323 11776986 0 3.0 3066 5280 453 6.36
Snf43 0 119.0 476 1343 1057 219417 9568585 0 2.0 2456 4182 482 6.76
Snf28 0 92.2 374 1078 850 180213 7681488 0 2.0 1957 3413 530 7.44
WT38 0 54.0 245 1010 620 270499 7197961 0 1.0 1682 3078 661 9.28

Distributions

Histograms of counts per gene

The summary only displays a few milestone values (mean, median, quartiles). In order to get a better intuition of the data, we can draw an histogram of all values.

par(mfrow=c(3,1))

hist(as.matrix(count.table), col="blue", border="white", breaks=100)

hist(as.matrix(count.table), col="blue", border="white",
     breaks=20000, xlim=c(0,2000), main="Counts per gene",
     xlab="Counts (truncated axis)", ylab="Number of genes", 
     las=1, cex.axis=0.7)

epsilon <- 1 # pseudo-count to avoid problems with log(0)
hist(as.matrix(log2(count.table + epsilon)), breaks=100, col="blue", border="white",
     main="Log2-transformed counts per gene", xlab="log2(counts+1)", ylab="Number of genes", 
     las=1, cex.axis=0.7)