Discrete distributions for the analysis of Next Generation Sequencing (NGS) data

Jacques van Helden

First version: 2016-12-10; Last update: 2017-02-22

Introduction

Discrete probabilities and NGS

The advent of Next Generation Sequencing (NGS) technologies revived the importance of discrete distributions of probabilities for biologists.

This tutorial aims at providing a rapid overview of some discrete distributions commonly used to analyse NGS data, and highlight the relationship between them.

Overview

Distribution Applications
Geometric Local read mapping without mismatch (read extension until first mismatch)
Binomial Global read mapping with a given number of mismatches
Negative binomial Local read mapping with \(m\) mismatches (waiting time for \((m+1)^{th}\) mismatch);
Detection of differentially expressed genes from RNA-seq data
Poisson ChIP-seq peak calling
Hypergeometric Enrichment of a set of differentially expressed genes for functional classes

Let us experiment first

The Poisson distribution

The Poisson is a very simple and widely used discrete distribution.

\[P(X=x) = \frac{e^{-\lambda}\lambda^x}{x!}\]

Exercise – Poisson distribution

Solution – mean and variance of a Poisson random sampling

lambda <- 3
rep <- 1000
x <- rpois(n=rep, lambda=lambda)
mean(x)
[1] 3.05
var(x)
[1] 3.13

Replicating an experiment

Solution – mean to variance relationship for the Poisson distribution

# ?replicate
## Example of usage of the replicate function
sampling.means <- replicate(n = 10000, mean(rpois(n=10, lambda=3.5)))

hist(sampling.means, breaks=100)

# This function returns the mean and variance of a 
# random sample drawn from a Poisson distribution
# 
# Parameters:
#   n sample size (number of elements to draw)
#   lambda expectation of the Poisson
rpois.mean.and.var <- function(n, lambda) {
  x <- rpois(n, lambda)
  return(data.frame(mean=mean(x), var=var(x), lambda=lambda, n=n))
}

rpois.mean.and.var(n=10, lambda=3.5)
  mean  var lambda  n
1  2.9 2.32    3.5 10
## Generate a data frame with random Poisson sampling with increasing values of lambda
poisson.stats <- data.frame()
for (i in 1:1000) {
  poisson.stats <- rbind(poisson.stats, rpois.mean.and.var(n=1000, lambda=i))  
}

## Compute the coefficient of variation
poisson.stats$V <- poisson.stats$mean/poisson.stats$var

head(poisson.stats)
  mean  var lambda    n     V
1 1.04 1.06      1 1000 0.986
2 2.02 2.00      2 1000 1.007
3 3.06 2.84      3 1000 1.079
4 4.04 4.16      4 1000 0.970
5 5.02 5.01      5 1000 1.003
6 6.12 5.80      6 1000 1.055
print(summary(poisson.stats))
      mean           var           lambda           n       
 Min.   :   1   Min.   :   1   Min.   :   1   Min.   :1000  
 1st Qu.: 250   1st Qu.: 253   1st Qu.: 251   1st Qu.:1000  
 Median : 501   Median : 502   Median : 500   Median :1000  
 Mean   : 501   Mean   : 501   Mean   : 500   Mean   :1000  
 3rd Qu.: 750   3rd Qu.: 756   3rd Qu.: 750   3rd Qu.:1000  
 Max.   :1000   Max.   :1090   Max.   :1000   Max.   :1000  
       V        
 Min.   :0.883  
 1st Qu.:0.967  
 Median :1.000  
 Mean   :1.000  
 3rd Qu.:1.031  
 Max.   :1.155  
## Summmarize the results with various plots
par(mfrow=c(2,2))
par(mar=c(5,4,1,1))
plot(poisson.stats$mean, poisson.stats$var, col=densCols(poisson.stats$mean, poisson.stats$var), panel.first = grid())
abline(a=0, b=1, col="brown", lwd=2)
plot(poisson.stats$mean, poisson.stats$var, col=densCols(poisson.stats$mean, poisson.stats$var),
    panel.first = grid(), log="xy")
abline(a=0, b=1, lwd=2)
plot(poisson.stats$mean, poisson.stats$V, col=densCols(poisson.stats$mean, poisson.stats$V),
     panel.first = grid())
abline(h=1, col="brown", lwd=2)

# plot(poisson.stats$mean, poisson.stats$V, col=densCols(poisson.stats$mean, poisson.stats$V), 
#      panel.first = grid(), log="xy")
# abline(h=1, col="brown", lwd=2)

hist(poisson.stats$V, breaks=100)