- Draw a barplot showing the number of reads in each sample. Use either the `colSums()` or the `apply()` function (run `help(colSums()` if you don't know this function).
- What can you say from this diagram?

View solution|
Hide solution
# Descriptive statistics
## Basic statistics
Before going further in the analysis, we will compute some descriptive statistics on the dataset.
At this stage we only compute statistics per sample, since statistics per gene are meaningful only after library-wise normalization of the counts.
```{r stats_per_sample}
## Dimensions
nb.samples <- ncol(count.table)
print(nb.samples)
nb_genes <- nrow(count.table)
print(nb_genes)
dim(count.table)
## Min, Max, median (...).
## Here on the first 4 samples
head(summary(count.table[,1:4]))
## A magic trick to convert column-wise summaries into a data.frame.
## The do.call() function produces a data frame with one col per sample,
## we transpose it to obtain one row per sample and one column per statistics.
stats.per.sample <- data.frame(t(do.call(cbind, lapply(count.table, summary))))
head(stats.per.sample)
## We can now add some columns to the stats per sample
stats.per.sample$libsum <- apply(count.table, 2, sum) ## libsum
# Add some percentiles
stats.per.sample$perc05 <- apply(count.table, 2, quantile, 0.05)
stats.per.sample$perc10 <- apply(count.table, 2, quantile, 0.10)
stats.per.sample$perc90 <- apply(count.table, 2, quantile, 0.90)
stats.per.sample$perc95 <- apply(count.table, 2, quantile, 0.95)
stats.per.sample$zeros <- apply(count.table==0, 2, sum)
stats.per.sample$percent.zeros <- 100*stats.per.sample$zeros/nrow(count.table)
# View(stats.per.sample)
kable(stats.per.sample[sample(1:ncol(count.table), size = 10),],
caption = "**Table: statistics per sample. ** We only display a random selection of 10 samples. ")
```
## Distributions
### Histograms of counts per gene
The summary only displays a few milestone values (mean, median, quartiles). In order to get a better intuition of the data, we can draw an histogram of all values.
```{r data_distrib_histogram, fig.path="figures/schurch2016_", fig.width=7, fig.height=9, fig.cap="Histogram of counts per genes. **Top: raw counts. ** the scale is determined by the gene with the highest count, which is apparently an outlier. **Middle: ** raw counts, with X axis truncated to 2000 in order to display a representative range despite outliers. **Bottom: ** log2-transformed counts (bottom) per gene, with a pseudocount of 1 to avoid minus infinitevalues resulting from zero counts. "}
par(mfrow=c(3,1))
hist(as.matrix(count.table), col="blue", border="white", breaks=100)
hist(as.matrix(count.table), col="blue", border="white",
breaks=20000, xlim=c(0,2000), main="Counts per gene",
xlab="Counts (truncated axis)", ylab="Number of genes",
las=1, cex.axis=0.7)
epsilon <- 1 # pseudo-count to avoid problems with log(0)
hist(as.matrix(log2(count.table + epsilon)), breaks=100, col="blue", border="white",
main="Log2-transformed counts per gene", xlab="log2(counts+1)", ylab="Number of genes",
las=1, cex.axis=0.7)
par(mfrow=c(1,1))
```
### Interpretation
- The top histogram is not very informative so far, apparently due to the presence of a few very high count values, that impose a very large scale on the $X$ axis.
- The middle histogram shows the representative range. Note the height of the first bin, which includes the zero counts.
- The logarithmic transformation (bottom histogram) improves the readability. Note that we added a pseudo-count of `r epsilon` to avoid problems with the log transformation of zero counts (which gives $-\infty$).
### Boxplots of gene count distributions per sample
To get better insights into the distribution per sample, *boxplots* offer a good perspective.
```{r log2_counts_boxplots, fig.path="figures/schurch2016_", fig.width=7, fig.height=10, fig.cap="Box plots of non-normalized log2(counts) per sample. "}
## Boxplots
boxplot(log2(count.table + epsilon), col=expDesign$color, pch=".",
horizontal=TRUE, cex.axis=0.5,
las=1, ylab="Samples", xlab="log2(Counts +1)")
```
### Density plots
Another way to get an intuition of the value distributions is to use the *geom_density()* function (ggplot2 library), which draws one density curve for each sample.
```{r log2_counts_densities_per_sample, fig.path="figures/schurch2016_", fig.width=7, fig.height=5, fig.cap="Densities of log2(counts). Each curve corresponds to one sample. "}
## Density
## We will require some functions from the reshape2 and ggplot2 packages
if(!require("reshape2")){
install.packages("reshape2")
}
if(!require("ggplot2")){
install.packages("ggplot2")
}
library(ggplot2)
library(reshape2)
count_melt <- reshape2::melt(log2(count.table + epsilon))
head(count_melt)
ggplot(data=count_melt, mapping=aes(x=value, color=variable)) + geom_density()
```
**Beware**: the R function *geom_density()* does not display the actual distribution of your values, but a polynomial fit. The representation thus generally looks smoother than the actual data. It is important to realize that, in some particular cases, the fit can lead to extrapolated values which can be misleading.
## Scatter plots
```{r scatter_plot_some_samples, fig.path="figures/schurch2016_", fig.width=8, fig.height=8, fig.cap="**Scatter plot of log2-counts for a random selection of samples. **"}
nb.pairs <- 6
## Define a function to draw a scatter plot for a pair of variables (samples) with density colors
plotFun <- function(x,y){
dns <- densCols(x,y);
points(x,y, col=dns, pch=".", panel.first=grid());
# abline(a=0, b=1, col="brown")
}
## Plot the scatter plot for a few pairs of variables selected at random
set.seed(123) # forces the random number generator to produce fixed results. Should generally not be used, except for the sake of demonstration with a particular selection.
pairs(log2(count.table[,sample(ncol(count.table), nb.pairs)] + epsilon),
panel=plotFun, lower.panel = NULL)
```
Let's have a look at the scatter plots using the *pairs()* function. We will only represent `r nb.pairs` randomly selected samples.
The command *pairs()* draws a scatter plot for each pair of columns of the input dataset. The plot shows a fairly good reproducibility between samples of the same type (WT or KO, respectively): all points are aligned along te diagonal, with a relatively wider dispersion at the bottom, corresponding to small number fluctuations.
In contrast, on all the plots comparing a WT and a KO sample, we can see some points (genes) discarding from the diagonal.
# Eliminating undetected genes
All genes from genome the *S. cerevisiae* where quantified. However only a fraction of them where expressed and some of them where to weakly expressed to be detected in any of the sample. As a result the count table may contain rows with only zero values (null counts).
- What is the percentage of gene having null counts per sample. Draw a barplot.
- Some genes were not detected in any of the sample. Count their number, and delete them from the **count.table** data.frame.

View solution|
Hide solution