This dataset is available from the Pasilla Bioconductor library and is derived from the work from Brooks et al. (Conservation of an RNA regulatory map between Drosophila and mammals. Genome Research, 2010).

Alternative splicing is generally controlled by proteins that bind directly to regulatory sequence elements and either activate or repress splicing of adjacent splice sites in a target pre-mRNA. Here, the authors have combined RNAi and mRNA-seq to identify exons that are regulated by Pasilla (PS), the Drosophila melanogaster ortholog of the mammalian RNA-binding proteins NOVA1 and NOVA2.

To get the dataset, you need to install the pasilla library from Bioconductor then load this library.

```
## Install the library if needed then load it
if(!require("pasilla")){
source("http://bioconductor.org/biocLite.R")
biocLite("pasilla")
}
```

`## Loading required package: pasilla`

`library("pasilla")`

To get the path to the tabulated file containing count table use the command below. Here, the **system.file()** function is simply used to get the path to the directory containing the count table. The “pasilla_gene_counts.tsv” file contains counts for each gene (row) in each sample (column).

`datafile <- system.file( "extdata/pasilla_gene_counts.tsv", package="pasilla" )`

Load the file using the read.table function.

```
## Read the data table
count.table<- read.table( datafile, header=TRUE, row.names=1, quote="", comment.char="" )
head(count.table)
```

```
## untreated1 untreated2 untreated3 untreated4 treated1 treated2
## FBgn0000003 0 0 0 0 0 0
## FBgn0000008 92 161 76 70 140 88
## FBgn0000014 5 1 0 0 4 0
## FBgn0000015 0 2 1 2 1 0
## FBgn0000017 4664 8714 3564 3150 6205 3072
## FBgn0000018 583 761 245 310 722 299
## treated3
## FBgn0000003 1
## FBgn0000008 70
## FBgn0000014 0
## FBgn0000015 0
## FBgn0000017 3334
## FBgn0000018 308
```

Some genes were not detected in any of the sample. Delete them from the **count.table** data.frame.

The dataset contains RNA-Seq count data for RNAi treated or S2-DRSC untreated cells (late embryonic stage). Some results were obtained through single-end sequencing whereas others were obtained using paired-end sequencing. We will store these informations in two vectors (cond.type and lib.type).

```
cond.type <- c( "untreated", "untreated", "untreated","untreated", "treated", "treated", "treated" )
lib.type <- c( "single-end", "single-end", "paired-end", "paired-end", "single-end", "paired-end", "paired-end" )
```

Next, we will extract a subset of the data containing only paired-end samples.

```
## Select only Paired-end datasets
isPaired <- lib.type == "paired-end"
show(isPaired)
```

`## [1] FALSE FALSE TRUE TRUE FALSE TRUE TRUE`

```
count.table <- count.table[ , isPaired ] ## Select only the paired samples
head(count.table)
```

```
## untreated3 untreated4 treated2 treated3
## FBgn0000003 0 0 0 1
## FBgn0000008 76 70 88 70
## FBgn0000014 0 0 0 0
## FBgn0000015 1 2 0 0
## FBgn0000017 3564 3150 3072 3334
## FBgn0000018 245 310 299 308
```

```
cond.type <- cond.type[isPaired]
show(cond.type)
```

`## [1] "untreated" "untreated" "treated" "treated"`

Before going further in the analysis, we will compute some descriptive statistics on the dataset.

- What are the dimensions of the
**count.table object**. - Use the
**summary**fonction with**count.table**as argument. What kind of information are displayed ? - Draw the distribution of the
**count.table**data. Is that really informative ? Why ? - Draw the distribution of the
**count.table**data in logarithme base 2 (add a pseudo-count to avoid logarithmic transformation of zero values). - Draw the boxplot of each column of
**count.table**using the**boxplot()**function. - Use the
**plotDensity()**from the**affy**library (BioC) to plot density estimates of each column of the**count.table**data.frame. - Use the
**pairs()**to produce a matrix of scatterplot from the**count.table**object.

Now that we have all the required material, we will create a **CountDataSet** object (named **cds**) that will be used by DESeq to perform differential expression call. The CountDataSet has some important useful accessor methods (**counts**, **conditions**, **estimateSizeFactors**, **sizeFactors**, **estimateDispersions** and **nbinomTest**) that will be used later in this tutorial.

- Install and load the DESeq library (it should be installed from BioC).
- Create an object of class
**CountDataSet**using the**newCountDataSet()**function and get some help about this object.

The normalization procedure (RLE) is implemented through the **estimateSizeFactors** function.

**From DESeq help files:** Given a matrix or data frame of count data, this function estimates the size factors as follows: Each column is divided by the **geometric means** of the rows. The **median** (or, if requested, another location estimator) **of these ratios** (skipping the genes with a geometric mean of zero) is used as the size factor for this column.

- Create a new object
**cds.norm**that will contain normalized data. The method associated with normalization for the CountDataSet class is**estimateSizeFactors()**. - Implement a function that will compute the size factors for each sample.
- Now, check that the results obtained with your function are the same as those produced by DESeq (you can get the size factors from a CountDataSet object using the
**sizeFactors()**function. - Check the distribution before and after normalization using the
**boxplot()**and**plotDensity()**functions. Do you see some differences. Does the count look more balanced across samples ?

In this section we will search for genes whose expression is affected by the si-RNA treatment.

Let say that we would produce a lot of RNA-Seq experiments from the same samples (technical replicates). For each gene \(g\) the measured read counts would be expected to vary rather slighlty around the expected mean and would be probably well-modeled using a poisson distribution. However, when working with biological replicates more variations are intrinsically expected. Indeed, due to sample purity, cell-synchronization issues or reponses to environment (e.g. heat-shock) the measured expression values for each genes are expected to fluctuate more importantly. The poisson distribution has only one parameter \(\lambda\) and the mean and variance of the distribution are both equal to \(\lambda\). Thus in most cases, the poisson distribution is not expected to fit very well with the count distribution since some over-dispersion (greater variability) due to biological noise is expected. As a consequence, when working with RNA-Seq data, many of the current approaches for differential expression call rely on the negative binomial distribution (note that this hold true also for other -Seq approaches, e.g. ChIP-Seq with replicates).

The negative binomial distribution is a discrete distribution that can be used to model over-dispersed data (in this case this overdispersion is relative to the poisson model). There are two ways to parametrize the negative binomial distribution. The negative binomial distribution is a discrete distribution that can be used to model over-dispersed data (in this case this overdispersion is relative to the poisson model). There are two ways to parametrize the negative binomial distribution.

First, given a Bernouilli trial with a probability \(p\) of success, the **negative binomial** distribution describes the probability of observing \(x\) failures before a target number of successes \(n\) is reached. In this case the parameters of the distribution will thus be \(p\), \(n\) (in **dnbinom()** function of R, \(n\) and \(p\) are denoted by arguments **size** and **prob** respectively).

\[P_{NegBin}(x; n, p) = \binom{x+n-1}{x}\cdot p^n \cdot (1-p)^x = C^{x}_{x+n-1}\cdot p^n \cdot (1-p)^x \]

In this formula, \(p^n\) denotes the probability to observe \(n\) successes, \((1-p)^x\) the probability of \(x\) failures, and the binomial coefficient \(C^{x}_{x+n-1}\) indicates the number of possible ways to dispose \(x\) failures among the \(x+n-1\) trials that precede the last one (the problem statement imposes for the last trial to be a success).

The negative binomial distribution has expected value \(n\frac{q}{p}\) and variance \(n\frac{q}{p^2}\). Some examples of using this distribution in R are provided below.

**Particular case**: when \(n=1\) the negative binomial corresponds to the the **geometric distribution**, which models the probability distribution to observe the first success after \(x\) failures: \(P_{NegBin}(x; 1, p) = P_{geom}(x; p) = p \cdot (1-p)^x\).

```
par(mfrow=c(1,1))
## Some intuition about the negative binomial parametrized using n and p.
## The simple case, one success (see geometric distribution)
# Let's have a look at the density
p <- 1/6 # the probability of success
n <- 1 # target for number of successful trials
# The density function
plot(0:10, dnbinom(0:10, n, p), type="h", col="blue", lwd=2)
```