We selected tiny subset of the original data, in order to be able to run a representative variety of tools during the practicals.
The study case consists of paired-end strand-oriented RNA-seq samples from the mouse (Mus musculus) cell line P5424, which was derived from a population of T lymphocytes (i.e. thymocytes). Cells were culture in two conditions:
The goal of the experiment was to detect Differentially Expressed Genes (DEG) between treated and untreated cells.
Sequencing was done in strand-oriented paired ends, with 150bp reads on each side of the fragments.
In order to ensure a quick treatment for this tutorial, we selected the subset of reads restricted to a 30Mb region located on the mouse chromosome 18 (a small chromosome).
http://yasminekzl.github.io/Tlemcen_workshop_2016/index.html
http://pedagogix-tagc.univ-mrs.fr/courses/data/ngs/rna_seq_lan
For this tutorial we will only use a pair of read sets restricted to a particular 30Mb region of mouse chromosome 18.
Note that you don’t need to download these files on your computer. In the tutorial, we will simply need to copy-paste their URL to the Galaxy server. This avoids to transfer the file twice via a limited bandwidth (from the server to the practical room, and then from the practical room to the server).
Name | Description / URL |
---|---|
SRA | Sequence Read Archive https://www.ncbi.nlm.nih.gov/sra |
ENA | European Nucleotide Archive http://www.ebi.ac.uk/ena |
GEO | Gene Expression Omnibus |
ArrayExpress |
The Tuxedo Suite (Trapnell et al. 2012) has been developed for RNA-Seq data analysis. It is mainly composed of Bowtie, Tophat, Cufflinks, CuffDiff. It is intended to provide powerful solutions for read mapping, discovery of novel gene structures and differential expression analysis. In the practical session we will use this suite to analyse two samples obtained from study “SRP000698”. This study is available in the Sequence Read Archive (SRA) database (https://www.ncbi.nlm.nih.gov/sra).
In the practical session we will use this suite to analyz e samples obtained from P5424 thymocyte cell line.
Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA) are public databases that provide tools to submit, access and mine functional genomics data. Data may be related to array- or sequence-based technologies. For HTS data, GEO provides both processed data (such as .bed and .wig files) and links to raw data. Raw data are available from the Sequence Read Archive (SRA) database (including 454, IonTorrent, Illumina, SOLiD, Helicos and Complete Genomics). Both web sites propose search engines to query their databases.
Here we will access the raw dataset through the SRA database
The Galaxy server motto is “Data intensive biology for everyone”. Galaxy is a web-based framework for data intensive biomedical research. Galaxy can be install easily on any computer but is also proposed through remote access by numerous research groups. It is intended to ease the development of complex workflows to analyze various types of biological data. Although it has been historically oriented toward NGS data analysis (ChIP-Seq, RNA-Seq, Ribosome profiling…), lots of public servers are also proposing set of tools dedicated to genomics, proteomics, image analysis, cancer stem cell analysis… The public server web page of the galaxy team lists all the publicly available servers throughout the world (n=70 at the time of writing).
Note: the Galaxy server is only maintained for didactic purposes, and its access is reserved to students from Aix-Marseille Université (M2 BBSG or Polytech). It is not intended to be a production server as it only maintained minimally.
Serviers
Server name | URL |
---|---|
rsat-tagc | http://rsat-tagc.univ-mrs.fr:8082 |
pedagogix | http://pedagogix-tagc.univ-mrs.fr/galaxy/ |
If this is your first connection, use the Register command. Otherwise, enter your login (use Login in the User menu at the top of the Galaxy window).
Analysis of the whole dataset would be time consuming. To make the analysis feasible within a reasonable time, data were previously mapped to the mouse genome (version mm9). A subset of reads that aligned onto chromosome 18 was extracted and will be used for this tutorial. The following dataset are available.
File name | Experiment | Description |
---|---|---|
DM1_chr18-20Mto50M_R1.fq.gz | Control DMSO, replicate 1 | Right end read. |
DM1_chr18-20Mto50M_R2.fq.gz | Control DMSO, replicate 1 | Left end read.. |
DM2_chr18-20Mto50M_R1.fq.gz | Control DMSO, replicate 2 | Right end read. |
DM2_chr18-20Mto50M_R2.fq.gz | Control DMSO, replicate 2 | Left end read. |
DM3_chr18-20Mto50M_R1.fq.gz | Control DMSO, replicate 3 | Right end read. |
DM3_chr18-20Mto50M_R2.fq.gz | Control DMSO, replicate 3 | Left end read. |
PI1_chr18-20Mto50M_R1.fq.gz | PMA/Ionomycine treated, replicate 1 | Right end read. |
PI1_chr18-20Mto50M_R2.fq.gz | PMA/Ionomycine treated, replicate 1 | Left end read.. |
PI2_chr18-20Mto50M_R1.fq.gz | PMA/Ionomycine treated, replicate 2 | Right end read. |
PI2_chr18-20Mto50M_R2.fq.gz | PMA/Ionomycine treated, replicate 2 | Left end read. |
PI3_chr18-20Mto50M_R1.fq.gz | PMA/Ionomycine treated, replicate 3 | Right end read. |
PI3_chr18-20Mto50M_R2.fq.gz | PMA/Ionomycine treated, replicate 3 | Left end read. |
These datasets are available directly in Galaxy to avoid network issues. We will start by analyzing the DM1 sample (control DMSO, replicate 1).
So far we imported the first read set of a paired-ends sequence set. We will now get the second read set in the same way.
FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses that you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. FastQC can be run as a stand alone interactive application (for the immediate analysis of small numbers of FastQ files) or in a non-interactive mode (through shell commands) where it would be suitable for processing large numbers of files.
It is important to stress that although the analysis results appear to give a pass/fail result, these evaluations must be taken in the context of what you expect from your library. A ‘normal’ sample as far as FastQC is concerned is random and diverse. Some experiments may be expected to produce libraries that are biased in particular ways. You should treat the summary evaluations therefore as pointers to where you should concentrate your attention and understand why your library may not look random and diverse.
This web site provides you with some nice example of sequencing failures and may help you in analyzing fastqc outputs.
Perform the same operation for DM1_R2 file.
Read trimming is a pre-processing step in which input read ends are cut (most generally the right end). One should keep in mind that this step may be crucial depending on the aligner used. Indeed most aligners will be unable to align a large fraction of the dataset when poor quality ends are kept. Several programs may be used to perform sequence trimming:
Here we will use sickle.
Most of the time the galaxy server will provide you with an already indexed genome that can be used by tophat to perform read alignment. In this practical, we would like to restrict the alignment to mouse chromosome 18 (this will be faster). We thus need to download the sequence of mouse chromosome 18. This sequence will be provided to tophat in the subsequent steps (tophat will perform sequence indexing internally by calling bowtie-build).
The sequence of the chr18 (mm9 build) can be downloaded through the UCSC web site.
NB: the chromosome sequence can also be obtained from ensembl ftp web site.
Several programs need to know about chromosome length to perform dedicated task. Chromosome information can be obtained using UCSC whose table-browser is interfaced in Galaxy.
In order to provide topHat with the location of known exons in the human genome, we will download a file in GTF format (Gene transfer format). You can get more information about this format on UCSC web site or GENCODE web site.
GTF file can be obtained both from UCSC table browser or ensembl ftp web site.
NB: it is very important at this step to ensure that the fasta file and the GTF file are obtained from the genome release (here mouse genome version mm9/GRCm37). The chromosome sequences and gene positions vary between genome releases.
Here we will use a GTF file containing information related to transcripts from mouse chromosome 18. This GTF file was obtained from GENCODE web site (Version M1 July 2011). Annotations are based on Ensembl server version 65.
TopHat is a fast splice-aware junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
We will start by mapping the reads corresponding to control sample.
NB: By default tophat will accept reads whose genomic mapping is ambiguous. This multi-mapped reads may be problematic in the downstream analysis. Indeed, keeping them may introduce spurious transcript models when trying to reconstruct underlying transcripts. However discarding them may also be problematic when computing expression levels of gene families. The Maximum number of alignments to be allowed argument instructs TopHat to allow up to this many alignments to the reference for a given read, and choose the alignments based on their alignment scores if there are more than this number. The default is 20 for read mapping. Depending on the need, more stringent policy may be chosen (e.g setting “Maximum number of alignments to be allowed” to 1 indicates that muti-mapped reads should be discarded). Additional arguments are available in the command line version of tophat including -x/–transcriptome-max-hits and -M/–prefilter-multihits.
We will used samtools flagstat to assess the number of aligned read available in the bam file.
The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.
Download the tophat result files from Galaxy to your computer:
Select mm9 as a genome and browse to chromosome 18.
In IGV, select the menu function File -> Load from file …, locate the download folder on your computer, select the bam file., and click Open.
Note: you do not need to load the bam index (bai) file because it will be loaded automatically with the bam file of the same name.
In the same way, load the splice junction file (bed format).
Mouse over a junction on of control_splice_junctions.bed track. What is the Depth about ?
As you may have notice the user needs to zoom inside IGV to visualize the alignments. This is due to the fact that BAM files may be very large (tens of Gb or more). Loading all information from the file would thus saturate the computer memory. We will thus create a more lightweight file that will just provide us with the mean coverage of each genomic region. This file in bigWig format will be compressed and indexed as the BAM files.
For the two wiggle files:
We can now use the cufflinks software to try to discover new transcripts inside the dataset. We will also provide cufflinks with the set of known transcript.
Galaxy allows user to apply the developed pipeline to another set of sample. To this aim, the user must create a workflow.
We will now apply this workflow to the sample corresponding to the activated thymocyte (replicate 1).
You may apply the workflow to all replicates. Do to that, create successively an history for storing DM2, DM3, PI2 and PI3. Import the corresponding files into the history and run the workflow. Don’t forget to import mm9_chr_size.txt, chr18_mm9_fa and chr18_20M-50M_gencode_vM1.gtf files.
Create a new history entitled DM versus PI. Copy the datasets below in this history (use history > Copy datasets).
We now have at least three different GTF files (depending on whether you have processed DM2,DM3,PI2,PI3):
We will ask cuffmerge to merge the novel annotations (obtained through cufflinks) with the reference (known annotation) and to classify the transcripts. It will annotate transcripts by producing a GTF file containing flags. Some of this flags may indicate that:
For a full description of all possible flags (“class code”), please refer to the cuffmerge web site (section ‘Transfrag class codes’).
Here we will concentrate on retrieving the position of novel transcripts.
The objective of quantification is to estimate the expression level of each gene by counting the number of reads overlapping each gene model. Several programs have been developed for this task (cuffdiff, featureCount, HTSeq-count,…). The FeatureCounts software is a lightweight read counting program written entirely in the C programming language. It has a variety of advanced parameters but its major strength is its outstanding performance (10GB SE BAM file takes about 7 minutes on a single average CPU).
## First we read gene counts
count <- read.table("raw_counts.txt" ,sep="\t", head=T,row=1)
#head(count)
#head(rownames(count))
#colnames(count)
# Change column names
colnames(count) <- c("control", "Treated")
## Values are log2 transformed
## (a pseudo-count is added in case one of the sample is equal or close to zero)
count <- log2(count +1)
## Checking distribution of FPKM values
hist(as.matrix(count), main="Distribution of count values")
boxplot(count, col=c("red","gray"), pch=16, main="Boxplot for count values")
## Scatter plot comparing expression levels in sample 1 and 2
par(xaxs='i',yaxs='i')
plot(count, pch=20, panel.first=grid(col="darkgray"))
identify(count[,1], count[,2],lab=rownames(count))
integer(0)
The gene ENSMUSG00000038418 seems to be strongly induced during T-cell activation. Go to Ensembl genome Browser and use the search area to get some information about it.
We have seen that gene Egr1 is strongly induced upon PMA/Ionomycin treatment. Our purpose is now to define a list of differentially expressed genes. To this aim we will use DESeq2 R library (interfaced in Galaxy server). DESeq2 will perform a statistical test that will point out some genes whose expression differs between both conditions. At these step we will work with the full dataset that can be obtained through Shared Data > Data Libraries > TlemCen 2016 > COUNTS > DM_vs_PI_gene_counts.txt > Import this dataset into selected history. In the new window select DM versus PI as Destination history. Click on Import library dataset.
We will now select differentially expressed genes by applying a threshold on the **adjusted p-value (padj) and log2-fold change.
Now we will extract the count for differentially expressed genes from the count matrix (DM_vs_PI_gene_counts.txt).
We will now apply a hierarchical clustering (class discovery) to check whether our selected genes are able to distinguish between PI and DM samples. First we will need to transform data into log2 scale.
Now we can load the result in Java Treeview.
Try to use edgeR or Voom to get a list of differentially expressed genes. Which one seems the most conservative ?
We will now use the DAVID database to perform functionnal enrichment analysis using our list of differentially expressed genes as input.
Trapnell, Cole, Adam Roberts, Loyal Goff, Geo Pertea, Daehwan Kim, David R Kelley, Harold Pimentel, Steven L Salzberg, John L Rinn, and Lior Pachter. 2012. “Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.” Nature Protocols 7 (3): 562–78.