Jacques van Helden
First version: 2016-12-10; Last update: 2017-02-22
The advent of Next Generation Sequencing (NGS) technologies revived the importance of discrete distributions of probabilities for biologists.
This tutorial aims at providing a rapid overview of some discrete distributions commonly used to analyse NGS data, and highlight the relationship between them.
Distribution | Applications |
---|---|
Geometric | Local read mapping without mismatch (read extension until first mismatch) |
Binomial | Global read mapping with a given number of mismatches |
Negative binomial | Local read mapping with \(m\) mismatches (waiting time for \((m+1)^{th}\) mismatch); Detection of differentially expressed genes from RNA-seq data |
Poisson | ChIP-seq peak calling |
Hypergeometric | Enrichment of a set of differentially expressed genes for functional classes |
The Poisson is a very simple and widely used discrete distribution.
\[P(X=x) = \frac{e^{-\lambda}\lambda^x}{x!}\]
expected variance: \(\sigma^2 = \lambda\)
More info: read the help for the Poisson distribution: help(Poisson)
lambda <- 3
rep <- 1000
x <- rpois(n=rep, lambda=lambda)
mean(x)
[1] 3.05
var(x)
[1] 3.13
runif()
and replicate()
make 1000 experiments consisting of the following steps:
plot the relationship between mean and variance for the Poisson distribution
# ?replicate
## Example of usage of the replicate function
sampling.means <- replicate(n = 10000, mean(rpois(n=10, lambda=3.5)))
hist(sampling.means, breaks=100)
# This function returns the mean and variance of a
# random sample drawn from a Poisson distribution
#
# Parameters:
# n sample size (number of elements to draw)
# lambda expectation of the Poisson
rpois.mean.and.var <- function(n, lambda) {
x <- rpois(n, lambda)
return(data.frame(mean=mean(x), var=var(x), lambda=lambda, n=n))
}
rpois.mean.and.var(n=10, lambda=3.5)
mean var lambda n
1 2.9 2.32 3.5 10
## Generate a data frame with random Poisson sampling with increasing values of lambda
poisson.stats <- data.frame()
for (i in 1:1000) {
poisson.stats <- rbind(poisson.stats, rpois.mean.and.var(n=1000, lambda=i))
}
## Compute the coefficient of variation
poisson.stats$V <- poisson.stats$mean/poisson.stats$var
head(poisson.stats)
mean var lambda n V
1 1.04 1.06 1 1000 0.986
2 2.02 2.00 2 1000 1.007
3 3.06 2.84 3 1000 1.079
4 4.04 4.16 4 1000 0.970
5 5.02 5.01 5 1000 1.003
6 6.12 5.80 6 1000 1.055
print(summary(poisson.stats))
mean var lambda n
Min. : 1 Min. : 1 Min. : 1 Min. :1000
1st Qu.: 250 1st Qu.: 253 1st Qu.: 251 1st Qu.:1000
Median : 501 Median : 502 Median : 500 Median :1000
Mean : 501 Mean : 501 Mean : 500 Mean :1000
3rd Qu.: 750 3rd Qu.: 756 3rd Qu.: 750 3rd Qu.:1000
Max. :1000 Max. :1090 Max. :1000 Max. :1000
V
Min. :0.883
1st Qu.:0.967
Median :1.000
Mean :1.000
3rd Qu.:1.031
Max. :1.155
## Summmarize the results with various plots
par(mfrow=c(2,2))
par(mar=c(5,4,1,1))
plot(poisson.stats$mean, poisson.stats$var, col=densCols(poisson.stats$mean, poisson.stats$var), panel.first = grid())
abline(a=0, b=1, col="brown", lwd=2)
plot(poisson.stats$mean, poisson.stats$var, col=densCols(poisson.stats$mean, poisson.stats$var),
panel.first = grid(), log="xy")
abline(a=0, b=1, lwd=2)
plot(poisson.stats$mean, poisson.stats$V, col=densCols(poisson.stats$mean, poisson.stats$V),
panel.first = grid())
abline(h=1, col="brown", lwd=2)
# plot(poisson.stats$mean, poisson.stats$V, col=densCols(poisson.stats$mean, poisson.stats$V),
# panel.first = grid(), log="xy")
# abline(h=1, col="brown", lwd=2)
hist(poisson.stats$V, breaks=100)