This dataset is available from the Pasilla Bioconductor library and is derived from the work from Brooks et al. (Conservation of an RNA regulatory map between Drosophila and mammals. Genome Research, 2010).
Alternative splicing is generally controlled by proteins that bind directly to regulatory sequence elements and either activate or repress splicing of adjacent splice sites in a target pre-mRNA. Here, the authors have combined RNAi and mRNA-seq to identify exons that are regulated by Pasilla (PS), the Drosophila melanogaster ortholog of the mammalian RNA-binding proteins NOVA1 and NOVA2.
To get the dataset, you need to install the pasilla library from Bioconductor then load this library.
## Install the library if needed then load it
if(!require("pasilla")){
source("http://bioconductor.org/biocLite.R")
biocLite("pasilla")
}
## Loading required package: pasilla
library("pasilla")
To get the path to the tabulated file containing count table use the command below. Here, the system.file() function is simply used to get the path to the directory containing the count table. The “pasilla_gene_counts.tsv” file contains counts for each gene (row) in each sample (column).
datafile <- system.file( "extdata/pasilla_gene_counts.tsv", package="pasilla" )
Load the file using the read.table function.
## Read the data table
count.table<- read.table( datafile, header=TRUE, row.names=1, quote="", comment.char="" )
head(count.table)
## untreated1 untreated2 untreated3 untreated4 treated1 treated2
## FBgn0000003 0 0 0 0 0 0
## FBgn0000008 92 161 76 70 140 88
## FBgn0000014 5 1 0 0 4 0
## FBgn0000015 0 2 1 2 1 0
## FBgn0000017 4664 8714 3564 3150 6205 3072
## FBgn0000018 583 761 245 310 722 299
## treated3
## FBgn0000003 1
## FBgn0000008 70
## FBgn0000014 0
## FBgn0000015 0
## FBgn0000017 3334
## FBgn0000018 308
As some genes were not detected in any of the samples, we will discard them before further analysis.
## Some genes were not detected at all in these samples. We will discard them.
count.table <- count.table[rowSums(count.table) > 0,]
The dataset contains RNA-Seq count data for RNAi treated or S2-DRSC untreated cells (late embryonic stage). Some results were obtained through single-end sequencing whereas others were obtained using paired-end sequencing. We will store these informations in two vectors (cond.type and lib.type).
cond.type <- c( "untreated", "untreated", "untreated","untreated", "treated", "treated", "treated" )
lib.type <- c( "single-end", "single-end", "paired-end", "paired-end", "single-end", "paired-end", "paired-end" )
Next, we will extract a subset of the data containing only paired-end samples.
## Select only Paired-end datasets
isPaired <- lib.type == "paired-end"
show(isPaired) # show is an R alternative to print
## [1] FALSE FALSE TRUE TRUE FALSE TRUE TRUE
count.table <- count.table[ , isPaired ] ## Select only the paired samples
head(count.table)
## untreated3 untreated4 treated2 treated3
## FBgn0000003 0 0 0 1
## FBgn0000008 76 70 88 70
## FBgn0000014 0 0 0 0
## FBgn0000015 1 2 0 0
## FBgn0000017 3564 3150 3072 3334
## FBgn0000018 245 310 299 308
cond.type <- cond.type[isPaired]
Phenotypic data need to be stored in a data.frame object for further analysis.
## We need to get phenotypic data into a data.frame for
## subsequent analysis
conditions <- data.frame(cond.type=factor(cond.type))
print(conditions)
## cond.type
## 1 untreated
## 2 untreated
## 3 treated
## 4 treated
Before going further in the analysis, we will compute some descriptive statistics on the dataset.
## Dimensions
ncol(count.table)
## [1] 4
nrow(count.table)
## [1] 12359
dim(count.table)
## [1] 12359 4
## Min, Max, median...
summary(count.table)
## untreated3 untreated4 treated2 treated3
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0
## 1st Qu.: 2.0 1st Qu.: 2.0 1st Qu.: 2.0 1st Qu.: 2
## Median : 67.0 Median : 80.0 Median : 82.0 Median : 91
## Mean : 676.3 Mean : 796.3 Mean : 774.5 Mean : 837
## 3rd Qu.: 466.0 3rd Qu.: 546.0 3rd Qu.: 552.0 3rd Qu.: 599
## Max. :131242.0 Max. :167116.0 Max. :146390.0 Max. :164148
The summary only displays a few milestone values (mean, median, quartiles). In order to get a better intuition of the data, we can draw an histogram of all values.
## Define colors for subsequent plots
col <- c("green","orange")[as.numeric(conditions$cond.type)]
show(col)
## [1] "orange" "orange" "green" "green"
## Data distribution
hist(as.matrix(count.table))
The histogram is not very informative so far, apparently due to the presence of a few very high count values, that impose a very large scale on the X axis. We can use a logarithmic transformation to improve the readability. Note that we will add pseudo count to avoid problems with “zero” counts observed for some genes in some samples.
## Data distribution in log scale.
## Logarithmic transformation
hist(as.matrix(log2(count.table + 1)), breaks=100)
## Boxplot
boxplot(log2(count.table + 1), col=col)
Anothre way to get an intuition of the value distributions is to use the plotDensity() function, which draws one density curve for each sample.
## Density
## We will require one function from the affy package
if(!require("affy")){
source("http://bioconductor.org/biocLite.R")
biocLite("affy")
}
## Loading required package: affy
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following object is masked from 'package:stats':
##
## xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, as.vector, cbind,
## colnames, duplicated, eval, evalq, Filter, Find, get,
## intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unlist
##
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
library(affy)
plotDensity(log2(count.table + 1), lty=1, col=col, lwd=2)
grid()
legend("topright", legend=names(count.table), col=col, lwd=2)
Beware: the R function plotDensity() does not display the actual distribution of your values, but a polynomial fit. The representation thus generally looks smoother than the actual data. It is important to realize that, in some particular cases, the fit can lead to extrapolated values which can be misleading.
Let’s have a look at the scatter plots using the pairs() function.
plotFun <- function(x,y){ dns <- densCols(x,y); points(x,y, col=dns, pch=".") }
pairs(log2(count.table + 1), panel=plotFun, lower.panel = NULL)