RNA-Seq analysis: practical session using Tuxedo suite

The "Tuxedo Suite" is mainly composed of Bowtie, Tophat, Cufflinks, CuffDiff. It has been developed in order to ease read mapping, discovery of splice junction and novel gene structure and differential expression analysis. In the practical session we will use this suite to analyse two samples obtained from study "SRP000698" available in the SRA database

Content

The SRP000698 dataset: Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing.
Quality control of high throughput sequencing data
Read trimming
Mapping read with TopHat
Samtools: sorting and indexing the BAM file.
Expression level estimate with cuffdiff
Comparing expression levels in R
Discovering novel genes
References

The SRP000698 dataset: Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing

In this article the authors have used RNA-Seq technology to compare the transcriptome of actived and resting T-Cells. Using this technology they were also able to monitor allele-specific expression (ASE), that is, specific expression arising from maternally and paternally derived alleles. In this tutorial we will mainly concentrate on mapping read to the genome and compute gene expression levels with the Tuxedo suite.

Getting more informations about the experiment

The SRA Sequence Read Archive (SRA) web site can be accessed here .
The SRA Run browser (Tools section) can be used to search for a SRA object Search >SRA Objects

Get informations about the SRA study "SRP000698"
What is the study about ?
What platform was used ?
How many reads were produced ?
How many samples were analyzed ?
Get informations about experiments "SRX011549" and "SRX011550" (SRA Objects)
Which of these two samples is untreated or treated ?
How many runs were performed per samples ?
Is this experiment single-end or paire-end sequencing ? What are the sizes of the reads ?
How many reads are available per run on average ? Calculate it roughly.
Select one run. What is the sequence of the first read ?
what is the quality of this read ?

Obtaining the data

Analysis of the whole dataset would be time consuming and would require access to a computing server. To make the analysis feasible on a desktop computer, data were previously retrieved from SRA, fastq-transformed using SRA toolkit (fastq-dump) and mapped to the human genome (version hg19). A subset of reads that aligned onto chromosome 10 was extracted and will be used for this tutorial.

Open a terminal.
Change the current working directory to /filer/shares/MC-CIS-ASG1/.
Create a new directory and give it your login as a name.
Go into this directory and create directories named fastq, progs, index, tophat_results annotationsand cufflinks_results.
Go in the fastq directory and retrieve the datasets below.
Uncompress the datasets.
Look at the first lines of the SRR027888.SRR027890_chr10_1.fastq file to check the fastq format.

File name	Experiment	Description
SRR027888.SRR027890_chr10_1.fastq.gz	SRX011549	Right end read
SRR027888.SRR027890_chr10_2.fastq.gz	SRX011549	Left end read
SRR027889.SRR027891_chr10_1.fastq.gz	SRX011550	Right end read
SRR027889.SRR027891_chr10_2.fastq.gz	SRX011550	Left end read

View solution| Hide solution

Solution

	
## Changing working directory
WORKINGDIR=/filer/shares/MC-CIS-ASG1/
cd ${WORKINGDIR}

## Creating sub-directories
mkdir -p ${USER}
ls -trl
cd ${USER}
mkdir -p fastq progs index tophat_results annotations cuffdiff_results
cd fastq

## Retrieving data
wget ftp://tagc.univ-mrs.fr/public/Tagc/Denis/SRP000698/SRR027888.SRR027890_chr10_1.fastq.gz
wget ftp://tagc.univ-mrs.fr/public/Tagc/Denis/SRP000698/SRR027888.SRR027890_chr10_2.fastq.gz
wget ftp://tagc.univ-mrs.fr/public/Tagc/Denis/SRP000698/SRR027889.SRR027891_chr10_1.fastq.gz
wget ftp://tagc.univ-mrs.fr/public/Tagc/Denis/SRP000698/SRR027889.SRR027891_chr10_2.fastq.gz

## Uncompressing data
gunzip *gz

## Checking fastq file format
less SRR027888.SRR027890_chr10_1.fastq # q to quit

Quality control of high throughput sequencing data

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

Download the FastQC program and install it inside the progs directory.

## Changing working directory to 'progs'
cd ${WORKINGDIR}${USER}"/progs"   

## Retrieving the FastQC program    
wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.10.1.zip

## Uncompressing the file
unzip fastqc_v0.10.1.zip

## make the fastqc file executable
chmod u+x FastQC/fastqc

Launch the FastQC program and check quality of SRR027888.SRR027890_chr10_1.fastq fastq file.

FastQC/fastqc

Carefully inspect all the statistics. What do you think of the overall quality of the dataset ?

Read trimming

Read trimming is a pre-processing step in which input read ends are cutted (most generally the right end). Here, reads were previously trimmed. However one should keep in mind that this step is crucial when working with bowtie/tophat. Indeed as bowtie does not perform "hard-clipping" (that is clip sequence NOT present in the reference) it may be unable to align a large fraction of the dataset when poor quality ends are kept. Several software may be used to perform sequence trimming:

NB:Note that we won't perform read trimming in this practical as reads were already trimmed.

Mapping read with TopHat

TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.

Indexing the reference

Here we will align reads to the chromosome 10 of the human genome (version hg19). We thus need to index the corresponding sequence to speed up the read mapping process. Sequence for chromosome 10 can be obtained from the ftp site of the UCSC.

## indexing the reference
cd ${WORKINGDIR}${USER}"/index/"
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr10.fa.gz
gunzip chr10.fa.gz
bowtie2-build chr10.fa chr10.hs

Getting annotation (gtf file)

Tophat should be able (when using reads of sufficient length) to perform spliced-alignments. However, the limits of exonic regions of the genome could be hard to define without any additional information. To ease the mapping of spliced-reads we will provide tophat with a GTF file containing all informations related to transcript coordinates. GTF files can be retrieved, for instance, from the UCSC (using the table browser) or from the illumina iGenome web site. Here we will retrieve from the following link a subset of the full set of human transcript in gtf format using the wget command. Place it into the annotations directory:

http://pedagogix-tagc.univ-mrs.fr/courses/data/annotations/hg19_chr10.gtf.gz

	
## Changing working directory 
cd ${WORKINGDIR}${USER}"/annotations"

## retrieve the gtf file
wget wget http://pedagogix-tagc.univ-mrs.fr/courses/data/annotations/hg19_chr10.gtf.gz

## uncompress the gtf file
gunzip hg19_chr10.gtf.gz

## Check file content
less hg19_chr10.gtf # press q to quit

Mapping

Now we can align reads with tophat

Change directory to the tophat_results directory.
Create a new directory SRR027888.SRR027890 and go to this directory.
Use the tophat command to align left and right reads onto chromosome 10. Use default parameters except the following:

Set --library-type argument to fr-unstranded.
Set max-multihits parameter (-g) and transcriptome-max-hits (-x) to 1 and to eliminate multireads.
Set SRR027888.SRR027890_chr10_1.fastq as the first RNA-Seq FASTQ file and SRR027888.SRR027890_chr10_2.fastq as the second RNA-Seq FASTQ file.
Set chr10.hs as a reference genome.
Set Mean Inner Distance (-r) between mate pairs to 300 (obtained based on protocol section of the article).
Set Std. Dev for Distance between Mate Pairs to 30 .
Set the output directory (-o) to the current directory.
Use the -G argument to provide tophat with the downloaded gtf file.
Execute.

View solution| Hide solution

Solution

	
## Changing working directory to tophat_results
cd ${WORKINGDIR}${USER}"/tophat_results"

## Creating a directory for storing tophat results for SRR027888/SRR027890 runs.
mkdir -p SRR027888.SRR027890
cd SRR027888.SRR027890

## Launching tophat
## -p for the number of core (adapt to your need and computer)
tophat2 -p 2 -g 1 -x 1 --mate-inner-dist 300 -G ../../annotations/hg19_chr10.gtf --mate-std-dev 30 -o ./  ../../index/chr10.hs ../../fastq/SRR027888.SRR027890_chr10_1.fastq ../../fastq/SRR027888.SRR027890_chr10_2.fastq

## listing the corresponding results
ls

Samtools: sorting and indexing the BAM file.

SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. The BAM files are compressed version of the SAM files. The samtools program provide various utilities for manipulating alignments in the SAM format, including sorting, merging and indexing (...).

In order to visualize the results in a genome browser, BAM file need to be sorted (according to chromosome and genomic coordinate). We must then index the subsequent file to speed up the queries when inspecting a particular region of the genome.

Rename the accepted_hits.bam file to CD4_activated.bam
Use samtools view to visualize the content of the compressed BAM file (CD4_activated.bam).
Index the sorted bam file using samtools.

View solution| Hide solution

Solution

	
## Changing name
mv accepted_hits.bam CD4_activated.bam
## Viewing the alignment results
samtools view  CD4_activated.bam | less # q to quit

## Indexing the bam file
samtools index  CD4_activated.bam

Viewing the results with Integrated Genome Browser (IGV).

The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.

Create an IGV account here
Download IGV and launch it with 750 MB or 1.2 Gb depending of your machine
Select hg19 genome and chromosome 10
Use File > Load from file and browse to the bam file.
Zoom in to visualize reads mapped onto the genome (chr10).
Load the junctions.bed, insertions.bed and deletions.bed into IGV (File > Load from file).

Now perform the same analysis (read mapping and FPKM computation) for the SRR027889.SRR027891 run.

View solution| Hide solution

Solution

## Changing working directory to tophat_results
cd ${WORKINGDIR}${USER}"/tophat_results"

## Creating a directory for storing tophat results for SRR027889.SRR027891 runs.
mkdir SRR027889.SRR027891
cd SRR027889.SRR027891

## Launching tophat
tophat2 -p 2 -g 1 -x 1 -r 200  -G ../../annotations/hg19_chr10.gtf --mate-std-dev 30  -o ./  ../../index/chr10.hs ../../fastq/SRR027889.SRR027891_chr10_1.fastq ../../fastq/SRR027889.SRR027891_chr10_2.fastq

## rename the bam file
mv accepted_hits.bam CD4_control.bam

## index the bam file
samtools index CD4_control.bam

Expression level estimate with cuffdiff

Computing counts

Cuffdiff can be used to compute transcript expression estimates and differential analysis. It can compute both FPKM (or RPKM) gene count estimate for RNA-Seq data. One benefit of cuffdiff is that count estimates are computed at different level: gene, transcript, ORF,... One drawback is that computing may be slow when working with large datasets. For information, an alternative is to use htseq-count.

Cuffdiff relies on a gtf file to compute gene expression. We will thus make use of the bam files together with the hg19_chr10.gtf file.

Change directory to the cuffdiff_results directory.
Use the cuffdiff command to compute gene expression. Use default parameters except the following:

Set --library-type argument to fr-unstranded.
Set frag-len-mean to 300 and frag-len-std-dev to 20.
Set the hg19_chr10.gtf file as annotation source (-G).
Set the accepted_hits.bam BAM file produced by TopHat as BAM input.
Use the hg19_chr10.gtf

file as input
Execute.
Check the results in the isoforms.fpkm_tracking file.

View solution| Hide solution

Solution

## Changing working directory to cuffdiff_results cd ${WORKINGDIR}${USER}"/cuffdiff_results" ## starting cuffdiff cuffdiff --no-diff --frag-len-mean 300 --frag-len-std-dev 20 --library-type fr-unstranded -p 1 ../annotations/hg19_chr10.gtf ../tophat_results/SRR027888.SRR027890/CD4_activated.bam ../tophat_results/SRR027889.SRR027891/CD4_control.bam

Comparing expression levels in R

Change directory to cuffdiff_results and start R.

cd ${WORKINGDIR}${USER}"/cuffdiff_results" R

Use the R code below to compare gene expression levels.

## First we read gene counts count <- read.table("genes.count_tracking" ,sep="\t", head=T,row=1) head(count) head(rownames(count)) colnames(count) ## creating an expression matrix count <- data.frame(count$q1_count, count$q2_count, row.names=rownames(count)) head(count) ## Values are log2 transformed ## (a pseudo-count is added in case one of the sample is equal or close to zero) count <- log2(count +1) ## Checking distribution of FPKM values hist(as.matrix(count), main="Distribution of count values") boxplot(count, col=c("red","gray"), pch=16, main="Boxplot for count valaues") ## Scatter plot comparing expression levels in sample 1 and 2 par(xaxs='i',yaxs='i') plot(count, pch=20, panel.first=grid(col="darkgray")) identify(count[,1], count[,2],lab=rownames(count))

Discovering novel genes

If you have some time left, use cufflinks using the -g argument to search for unknown gene structures. Using bedtools try to identify novel genes that are at least 10kb away from known any genes.

References

Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing. Heap GA, Yang JH, Downes K, Healy BC, Hunt KA, Bockett N, Franke L, Dubois PC, Mein CA, Dobson RJ, Albert TJ, Rodesch MJ, Clayton DG, Todd JA, van Heel DA, Plagnol V. Hum Mol Genet. 2010 Jan 1;19(1):122-34.