# Discrete distributions for the analysis of Next Generation Sequencing (NGS) data

First version: 2016-12-10; Last update: 2017-02-22

# Discrete probabilities and NGS

The advent of Next Generation Sequencing (NGS) technologies revived the importance of discrete distributions of probabilities for biologists.

This tutorial aims at providing a rapid overview of some discrete distributions commonly used to analyse NGS data, and highlight the relationship between them.

# Overview

Distribution Applications
Binomial Global read mapping with a given number of mismatches
Negative binomial Local read mapping with $$m$$ mismatches (waiting time for $$(m+1)^{th}$$ mismatch);
Detection of differentially expressed genes from RNA-seq data
Poisson ChIP-seq peak calling
Hypergeometric Enrichment of a set of differentially expressed genes for functional classes

# The Poisson distribution

The Poisson is a very simple and widely used discrete distribution.

$P(X=x) = \frac{e^{-\lambda}\lambda^x}{x!}$

• represents the probability to observe $$x$$ successes when expecting $$\lambda$$ (say “lambda”).
• expected mean (for a sample of infinite size): $$\mu = \lambda$$
• expected variance: $$\sigma^2 = \lambda$$

# Exercise – Poisson distribution

• open collective result table
• login with the email on which you were invited
• each student has been assigned a $$\lambda$$ comprized between 0.01 and 1000
• draw $$rep=1000$$ random numbers following a Poisson with this $$\lambda$$ value
• compute the mean and variance
• fill up the corresponding columns in the collective report

# Solution – mean and variance of a Poisson random sampling

lambda <- 3
rep <- 1000
x <- rpois(n=rep, lambda=lambda)
mean(x)
[1] 3.05
var(x)
[1] 3.13

# Replicating an experiment

• read the help for runif() and replicate()
• make 1000 experiments consisting of the following steps:

• select at random a $$\lambda$$ value between $$0.5$$ and $$1000$$
• draw $$n=10$$ random numbers following a Poisson with this $$\lambda$$
• compute the mean and variance
• plot the relationship between mean and variance for the Poisson distribution

# Solution – mean to variance relationship for the Poisson distribution

# ?replicate
## Example of usage of the replicate function
sampling.means <- replicate(n = 10000, mean(rpois(n=10, lambda=3.5)))

hist(sampling.means, breaks=100)

# This function returns the mean and variance of a
# random sample drawn from a Poisson distribution
#
# Parameters:
#   n sample size (number of elements to draw)
#   lambda expectation of the Poisson
rpois.mean.and.var <- function(n, lambda) {
x <- rpois(n, lambda)
return(data.frame(mean=mean(x), var=var(x), lambda=lambda, n=n))
}

rpois.mean.and.var(n=10, lambda=3.5)
mean  var lambda  n
1  2.9 2.32    3.5 10
## Generate a data frame with random Poisson sampling with increasing values of lambda
poisson.stats <- data.frame()
for (i in 1:1000) {
poisson.stats <- rbind(poisson.stats, rpois.mean.and.var(n=1000, lambda=i))
}

## Compute the coefficient of variation
poisson.stats$V <- poisson.stats$mean/poisson.stats$var head(poisson.stats) mean var lambda n V 1 1.04 1.06 1 1000 0.986 2 2.02 2.00 2 1000 1.007 3 3.06 2.84 3 1000 1.079 4 4.04 4.16 4 1000 0.970 5 5.02 5.01 5 1000 1.003 6 6.12 5.80 6 1000 1.055 print(summary(poisson.stats)) mean var lambda n Min. : 1 Min. : 1 Min. : 1 Min. :1000 1st Qu.: 250 1st Qu.: 253 1st Qu.: 251 1st Qu.:1000 Median : 501 Median : 502 Median : 500 Median :1000 Mean : 501 Mean : 501 Mean : 500 Mean :1000 3rd Qu.: 750 3rd Qu.: 756 3rd Qu.: 750 3rd Qu.:1000 Max. :1000 Max. :1090 Max. :1000 Max. :1000 V Min. :0.883 1st Qu.:0.967 Median :1.000 Mean :1.000 3rd Qu.:1.031 Max. :1.155 ## Summmarize the results with various plots par(mfrow=c(2,2)) par(mar=c(5,4,1,1)) plot(poisson.stats$mean, poisson.stats$var, col=densCols(poisson.stats$mean, poisson.stats$var), panel.first = grid()) abline(a=0, b=1, col="brown", lwd=2) plot(poisson.stats$mean, poisson.stats$var, col=densCols(poisson.stats$mean, poisson.stats$var), panel.first = grid(), log="xy") abline(a=0, b=1, lwd=2) plot(poisson.stats$mean, poisson.stats$V, col=densCols(poisson.stats$mean, poisson.stats$V), panel.first = grid()) abline(h=1, col="brown", lwd=2) # plot(poisson.stats$mean, poisson.stats$V, col=densCols(poisson.stats$mean, poisson.stats$V), # panel.first = grid(), log="xy") # abline(h=1, col="brown", lwd=2) hist(poisson.stats$V, breaks=100)