RNA-Seq analysis: practical session using Tuxedo suite

The "Tuxedo Suite" is mainly composed of Bowtie, Tophat, Cufflinks, CuffDiff. It has been developed in order to ease read mapping, discovery of splice junction and novel gene structure and differential expression analysis. In the practical session we will use this suite to analyse two samples obtained from study "SRP000698" available in the SRA database


The SRP000698 dataset: Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing

In this article the authors have used RNA-Seq technology to compare the transcriptome of actived and resting T-Cells. Using this technology they were also able to monitor allele-specific expression (ASE), that is, specific expression arising from maternally and paternally derived alleles. In this tutorial we will mainly concentrate on mapping read to the genome and compute gene expression levels with the Tuxedo suite.

Getting more informations about the experiment

  1. Get informations about the SRA study "SRP000698"
  2. What is the study about ?
  3. What platform was used ?
  4. How many reads were produced ?
  5. How many samples were analyzed ?
  6. Get informations about experiments "SRX011549" and "SRX011550" (SRA Objects)
  7. Which of these two samples is untreated or treated ?
  8. How many runs were performed per samples ?
  9. Is this experiment single-end or paire-end sequencing ? What are the sizes of the reads ?
  10. How many reads are available per run on average ? Calculate it roughly.
  11. Select one run. What is the sequence of the first read ?
  12. what is the quality of this read ?

Obtaining the data

Analysis of the whole dataset would be time consuming and would require access to a computing server. To make the analysis feasible on a desktop computer, data were previously retrieved from SRA, fastq-transformed using SRA toolkit (fastq-dump) and mapped to the human genome (version hg19). A subset of reads that aligned onto chromosome 10 was extracted and will be used for this tutorial.

  1. Open a terminal.
  2. Change the current working directory to /filer/shares/MC-CIS-ASG1/.
  3. Create a new directory and give it your login as a name.
  4. Go into this directory and create directories named fastq, progs, index, tophat_results annotationsand cufflinks_results.
  5. Go in the fastq directory and retrieve the datasets below.
  6. Uncompress the datasets.
  7. Look at the first lines of the SRR027888.SRR027890_chr10_1.fastq file to check the fastq format.
File name Experiment Description
SRR027888.SRR027890_chr10_1.fastq.gz SRX011549 Right end read
SRR027888.SRR027890_chr10_2.fastq.gz SRX011549 Left end read
SRR027889.SRR027891_chr10_1.fastq.gz SRX011550 Right end read
SRR027889.SRR027891_chr10_2.fastq.gz SRX011550 Left end read
View solution| Hide solution

Quality control of high throughput sequencing data

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

  1. Download the FastQC program and install it inside the progs directory.
  2. ## Changing working directory to 'progs'
    cd ${WORKINGDIR}${USER}"/progs"   
    ## Retrieving the FastQC program    
    wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.10.1.zip
    ## Uncompressing the file
    unzip fastqc_v0.10.1.zip
    ## make the fastqc file executable
    chmod u+x FastQC/fastqc
  3. Launch the FastQC program and check quality of SRR027888.SRR027890_chr10_1.fastq fastq file.

  1. Carefully inspect all the statistics. What do you think of the overall quality of the dataset ?

Read trimming

Read trimming is a pre-processing step in which input read ends are cutted (most generally the right end). Here, reads were previously trimmed. However one should keep in mind that this step is crucial when working with bowtie/tophat. Indeed as bowtie does not perform "hard-clipping" (that is clip sequence NOT present in the reference) it may be unable to align a large fraction of the dataset when poor quality ends are kept. Several software may be used to perform sequence trimming:


NB:Note that we won't perform read trimming in this practical as reads were already trimmed.

Mapping read with TopHat

TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.

Indexing the reference

Here we will align reads to the chromosome 10 of the human genome (version hg19). We thus need to index the corresponding sequence to speed up the read mapping process. Sequence for chromosome 10 can be obtained from the ftp site of the UCSC.

## indexing the reference
cd ${WORKINGDIR}${USER}"/index/"
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr10.fa.gz
gunzip chr10.fa.gz
bowtie2-build chr10.fa chr10.hs

Getting annotation (gtf file)

Tophat should be able (when using reads of sufficient length) to perform spliced-alignments. However, the limits of exonic regions of the genome could be hard to define without any additional information. To ease the mapping of spliced-reads we will provide tophat with a GTF file containing all informations related to transcript coordinates. GTF files can be retrieved, for instance, from the UCSC (using the table browser) or from the illumina iGenome web site. Here we will retrieve from the following link a subset of the full set of human transcript in gtf format using the wget command. Place it into the annotations directory:


## Changing working directory 
cd ${WORKINGDIR}${USER}"/annotations"

## retrieve the gtf file
wget wget http://pedagogix-tagc.univ-mrs.fr/courses/data/annotations/hg19_chr10.gtf.gz

## uncompress the gtf file
gunzip hg19_chr10.gtf.gz

## Check file content
less hg19_chr10.gtf # press q to quit


Now we can align reads with tophat

  1. Change directory to the tophat_results directory.
  2. Create a new directory SRR027888.SRR027890 and go to this directory.
  3. Use the tophat command to align left and right reads onto chromosome 10. Use default parameters except the following:
    1. Set --library-type argument to fr-unstranded.
    2. Set max-multihits parameter (-g) and transcriptome-max-hits (-x) to 1 and to eliminate multireads.
    3. Set SRR027888.SRR027890_chr10_1.fastq as the first RNA-Seq FASTQ file and SRR027888.SRR027890_chr10_2.fastq as the second RNA-Seq FASTQ file.
    4. Set chr10.hs as a reference genome.
    5. Set Mean Inner Distance (-r) between mate pairs to 300 (obtained based on protocol section of the article).
    6. Set Std. Dev for Distance between Mate Pairs to 30 .
    7. Set the output directory (-o) to the current directory.
    8. Use the -G argument to provide tophat with the downloaded gtf file.
    9. Execute.

View solution| Hide solution

Samtools: sorting and indexing the BAM file.

SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. The BAM files are compressed version of the SAM files. The samtools program provide various utilities for manipulating alignments in the SAM format, including sorting, merging and indexing (...).

In order to visualize the results in a genome browser, BAM file need to be sorted (according to chromosome and genomic coordinate). We must then index the subsequent file to speed up the queries when inspecting a particular region of the genome.

  1. Rename the accepted_hits.bam file to CD4_activated.bam
  2. Use samtools view to visualize the content of the compressed BAM file (CD4_activated.bam).
  3. Index the sorted bam file using samtools.

View solution| Hide solution

Viewing the results with Integrated Genome Browser (IGV).

The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.

  1. Create an IGV account here
  2. Download IGV and launch it with 750 MB or 1.2 Gb depending of your machine
  3. Select hg19 genome and chromosome 10
  4. Use File > Load from file and browse to the bam file.
  5. Zoom in to visualize reads mapped onto the genome (chr10).
  6. Load the junctions.bed, insertions.bed and deletions.bed into IGV (File > Load from file).
Now perform the same analysis (read mapping and FPKM computation) for the SRR027889.SRR027891 run.
View solution| Hide solution

Expression level estimate with cuffdiff

Computing counts

Cuffdiff can be used to compute transcript expression estimates and differential analysis. It can compute both FPKM (or RPKM) gene count estimate for RNA-Seq data. One benefit of cuffdiff is that count estimates are computed at different level: gene, transcript, ORF,... One drawback is that computing may be slow when working with large datasets. For information, an alternative is to use htseq-count.

Cuffdiff relies on a gtf file to compute gene expression. We will thus make use of the bam files together with the hg19_chr10.gtf file.

  1. Change directory to the cuffdiff_results directory.
  2. Use the cuffdiff command to compute gene expression. Use default parameters except the following:
    1. Set --library-type argument to fr-unstranded.
    2. Set frag-len-mean to 300 and frag-len-std-dev to 20.
    3. Set the hg19_chr10.gtf file as annotation source (-G).
    4. Set the accepted_hits.bam BAM file produced by TopHat as BAM input.
    5. Use the hg19_chr10.gtf
    6. file as input
    7. Execute.
    8. Check the results in the isoforms.fpkm_tracking file.
View solution| Hide solution

Comparing expression levels in R

  • Change directory to cuffdiff_results and start R.
    cd ${WORKINGDIR}${USER}"/cuffdiff_results"
  • Use the R code below to compare gene expression levels.
  • ## First we read gene counts
    count <- read.table("genes.count_tracking" ,sep="\t", head=T,row=1)
    ## creating an expression matrix
    count <- data.frame(count$q1_count, count$q2_count, row.names=rownames(count))
    ## Values are log2 transformed 
    ## (a pseudo-count is added in case one of the sample is equal or close to zero)
    count <- log2(count +1)
    ## Checking distribution of FPKM values
    hist(as.matrix(count), main="Distribution of count values")
    boxplot(count, col=c("red","gray"), pch=16, main="Boxplot for count valaues")
    ## Scatter plot comparing expression levels in sample 1 and 2
    plot(count, pch=20, panel.first=grid(col="darkgray"))
    identify(count[,1], count[,2],lab=rownames(count))

    Discovering novel genes

    If you have some time left, use cufflinks using the -g argument to search for unknown gene structures. Using bedtools try to identify novel genes that are at least 10kb away from known any genes.


    1. Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing. Heap GA, Yang JH, Downes K, Healy BC, Hunt KA, Bockett N, Franke L, Dubois PC, Mein CA, Dobson RJ, Albert TJ, Rodesch MJ, Clayton DG, Todd JA, van Heel DA, Plagnol V. Hum Mol Genet. 2010 Jan 1;19(1):122-34.