Workshop: ChIP-Seq Data Analysis

Introduction

Goal
The aim is to :

Understand how to process reads to obtain peaks (peak-calling).
Become familiar with differential analysis of peaks

In practice :

Obtain dataset from GEO
Analyze mapped reads
Obtain set(s) of peaks, handle replicates
Differential analysis of peaks

Create a galaxy instance at IFB

Goal:The objective is to create a virtual machine at IFB that propose a Galaxy server through a http server. This machine will be used thereafter. NB: you will have full administration rights on this machine (this allows you to install additional tools when required).

Connection to the IFB Cloud

Go to the IFB cloud connection page and click the link Se connecter.
Sign in using your login and password. You should then see the IFB cloud dashboard page.

Shut down your previous virtual machine if still open

Warning: in principle you should have shut down your virtual machine from yesterday's practical, in order to free the resources of the cloud. If this is not the case, please do it before creating a new virtual machine.

Protocol to shut down your previous Virtual Machine (if required):

In the cloud dashboard, click Show Instances.
In the list of instance, check the box in front of your previous virtual machine.
At the top of the instance list, select the **Shutdown** option and click **Go**.
Confirm the shutting down. Wait a few seconds and click Show Instances to check the proper shutting down of the virtual machine (it takes some times to shut down a VM, since it performs the same operations as when you properly shut down a real computer.

Create a virtual hard drive (vDisk)

For this tutorial, we need to create a completely empty virtual disk (*vDisk*), and to mount it on a new instance of the virtual machine (*VM*) dedicated to ChIP-seq analysis.

In the dasboard, click on the Show vDisks button.
Click New vDisk, set
- Size: 10 (i.e. 10 Gb)
- Name: tuto-chipseq-hdd.

Create an instance of the "EBA Galaxy ChIP-seq" virtual machine

Click on New Instance.
In the opened panel, select the following options:

Appliance: EBA15 Galaxy ChIP-Seq.
Name: tuto-chipseq-vm (vm stands for "Virtual Machine").
Type: c3.xlarge (4 CPUs 16Go).
Persistent disk: select the virtual disk created at the previous step: tuto-chip-seq-hdd.
Press Run.

Note: It will take a few minutes to start the virtual machine, and to initiate the Galaxy server. Click periodically on the button View Instances until you see a green circle in the row of your tuto-chipseq-vm instance.

Create an instance of the "EBA Galaxy ChIP-seq" virtual machine

Once the virtual machine has been created, right-click on the orange link http and open it in a separate tab. This will allow you to keep a tab with the dashboard open, which will be needed after the tutorial, in order to shut down your VM.
It may take a few minutes before the Galaxy server becomes available. If you see a message "Service temporarily available", wait a minute or two and refresh the page.

First look at the data...

Goal: this first exercice is meant to let you have a first look at the aligned reads.

Procedure:

Go to your Galaxy instance
At the top click onto Shared data > Data libraries
in the bam section, select the following datasets :
- SRR540191
- SRR540220
At the bottom, select Import into current history and click on Go
The datasets should appear in your current history
Go to your history
Click on the name of the datasets in the history
Click on the disk icone on each dataset and select Download Dataset to download the bam file
Then, click again on the disk icone on each dataset and select Download bam_index to download the index of the bam file
Load the bam files into IGV

Navigate randomly : do you spot enriched regions ? Zoom into these regions and look at the reads.
Go to the TFF1 gene and have a look at both datasets

Retrieve datasets on Gene Expression Omnibus

Goal: this first exercice is meant to demonstrate how one can typically retrieve published datasets from the Gene Expression Omnibus website, for further analysis. However, we will then use datasets that have already been downloaded and pre-processed to save time. .

1 - About the dataset

For this tutorial we will use ChIP-seq datasets produced by Theodorou et al. The authors used ChIP-Seq technology in order to systematically identify Estrogen receptor (abbreviated as ER or ERS1) binding regions across the human genome. Importantly, they demonstrated that knock-down of GATA3 through siRNA strongly affect ESR1 binding sites. The corresponding abstract of the article is provided below.

Abstract

Estrogen receptor (ESR1) drives growth in the majority of human breast cancers by binding to regulatory elements and inducing transcription events that promote tumor growth. Differences in enhancer occupancy by ESR1 contribute to the diverse expression profiles and clinical outcome observed in breast cancer patients. GATA3 is an ESR1-cooperating transcription factor mutated in breast tumors; however, its genomic properties are not fully defined.
In order to investigate the composition of enhancers involved in estrogen-induced transcription and the potential role of GATA3, we performed extensive ChIP-sequencing in unstimulated breast cancer cells and following estrogen treatment. We find that GATA3 is pivotal in mediating enhancer accessibility at regulatory regions involved in ESR1-mediated transcription. GATA3 silencing resulted in a global redistribution of cofactors and active histone marks prior to estrogen stimulation. These global genomic changes altered the ESR1-binding profile that subsequently occurred following estrogen, with events exhibiting both loss and gain in binding affinity, implying a GATA3-mediated redistribution of ESR1 binding. The GATA3-mediated redistributed ESR1 profile correlated with changes in gene expression, suggestive of its functionality. Chromatin loops at the TFF locus involving ESR1-bound enhancers occurred independently of ESR1 when GATA3 was silenced, indicating that GATA3, when present on the chromatin, may serve as a licensing factor for estrogen-ESR1-mediated interactions between cis-regulatory elements. Together, these experiments suggest that GATA3 directly impacts ESR1 enhancer accessibility, and may potentially explain the contribution of mutant-GATA3 in the heterogeneity of ESR1+ breast cancer.

Within the article, there is a section mentioning the accession number of the dataset produced :

Data access

The microarray data and ChIP-seq data from this study have been deposited in the NCBI Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/) under accession nos. GSE39623 and GSE40129, respectively.

2 - Find the dataset on Gene Expression Omnibus

Gene Expression Omnibus (GEO) is a public repository that provide tools to submit, access and mine functional genomics data. Data may be related to array- or sequence-based technologies. For HTS data, GEO provides both processed data (such as *.bam, *.bed, *.wig files) and links to raw data. Raw data are available from the Sequence Read Archive (SRA) database (including 454, IonTorrent, Illumina, SOLiD, Helicos and Complete Genomics). Both web sites propose search engines to query their databases.

Procedure :

Go to GEO web site.
Choose "Search" and paste GSE40129 (GSE stands for GEO Series Experiment). Click "GO" to get information about this experiment.
In the "sample section" (middle of the page), click on "More" to visualize all sample names.
Click on GSM986059 hyperlink (GSM stands for GEO SaMple) to get information about this sample.
In the "relations" section, select "SRX176856" hyperlink to open the SRA page corresponding to this sample.
Click on the SRR link (bottom left) to access the record of the run.
On the new page, click on the Reads tab to view the read sequence.

From there, you might also download the dataset as a .sra file, but we will not do it in the context of this practical (beware, this would take time and occupy disk space, siidnce SRA files typically weight several hundred Mb !).

Questions:

What is the HTS platform used to sequence this sample ?
Is this experiment single-end or paire-end sequencing ?
What is the read length ?
How many runs (i.e. lanes) are associated to this sample ?
How many reads were produced (# of Spots) ?
Select SRR540188 hyperlink. What is the sequence of the first read ?

At this point, you should be able to find a given dataset in GEO, and obtain the raw data (reads).

Analyze the mapped reads

Goal: Retrieve BAM files that contain alignment results, inspect and analyze the mapped reads.

1 - Getting the mapped reads (BAM) within Galaxy

For the sake of time and to avoid repetitions of processing steps already covered in other tutorials, we have already performed quality-check of the reads and mapping. The steps are described here.We encourage participants to check these steps later.

In the next section, we will search for ER (Estrogen Receptor) binding sites in control samples (ChIP Estrogen Receptor on MCF-7 cell line treated with E2). In this tutorial, we will focus on quality control of the aligned datasets, peak calling and differential binding analysis. Hence, the starting point will be bam files of aligned reads for the different datasets:

ChIP-seq on estrogen receptor (ER) in wild-type condition (siNT), after estrogen induction (E2) (3 replicates available,siNT_ER_E2_r1,2,3)
ChIP-seq on estrogen receptor (ER) after GATA knockout (siGATA), after estrogen induction (E2) (3 replicates available,siGATA_ER_E2_r1,2,3)
ChIP-seq on the histone mark H3K4me1 in siNT (1 replicate available,siNT_H3K4me1_Veh_r1)
ChIP-seq on the histone mark H3K4me1 in siGATA (1 replicate available,siGATA_H3K4me1_Veh_r1)
input control in MCF7 cells (1 replicate available)

Procedure to import shared history:

Log into your Galaxy account.
Use Shared data > data libraries > ChIP-seq datasets
Click on Import into current history
The datasets should appear in a newly created history called Imported: ChIP-seq datasets
Rename this history

2 - Number of mapped reads

Before starting the peak calling analysis, it is interesting to determine the type of alignments enclosed in the BAM files, and the level of duplicates.

Procedure:

In the search tools box, search for the flagstats tool.
Run this tool on 2 BAM files you imported: siNT_ER_E2_r3 and siGATA_ER_E2_r3.

Question: How many reads does the BAM files contain ?

The table below sumarizes the results for all datasets (all flagstat files are accessible in the Galaxy shared history Herrmann_Flagstats).

BAM file	Nb reads
siNT_ER_E2_r1	21 377 808
siNT_ER_E2_r2	12 035 808
siNT_ER_E2_r3	26 826 609
siGATA_ER_E2_r1	14 177 394
siGATA_ER_E2_r2	11 588 402
siGATA_ER_E2_r3	27 429 291
siNT_H3K4me1_Veh_r1	22 205 385
siGATA_H3K4me1_Veh_r1	22 227 392
MCF_input_r3	19 361 330

3 - Coverage for individual BAM files

We will now convert a BAM file to a bigWig file, which we can then upload to IGV for visual inspection. We will do this separately for signal and input, and then produce a combined file in which the background noise has been subtracted from the signal.

Procedure:

Find the tool bamCoverage in the deepTools section
Select the BAM file for the signal file siNT_ER_E2_r3, and run the tool on chromosome 1 to reduce computational time !
For Average size of fragment length, choose 150 bp. We will check later on whether this estimation is corrrect !
Keep other parameters by default
Execute, and rename the output (otherwise, it might be erased by another run)
Download the resulting file and open it in the IGV browser
In IGV, right click on the left panel : use set data range, and set Max Value to 100
Repeat the same operation for the BAM file corresponding to the H3K4me1, and open the resulting bigwig file under the previous one in IGV.
Repeat the same operation for the BAM file corresponding to the input, and open the resulting bigwig file under the previous one in IGV.
Download the BAM files and corresponding indexes (bai) for: siNT_ER_E2_r3, H3K4me1 and input. Load them using IGV. Go to KIAA1324 gene.
For each bam, click on the left panel of the corresponding track and set Color alignment by > read strand.
Unzoom and and select regions displaying high signal based on the coverage tracks (i.e bigWig).

Questions:

Do you see regions that seem to be enriched in signal compared to background ?
In the TFF1 gene, check the signal on plus and minus strands for the BAM track obtained from siNT_ER_E2_r3. Does it correspond to the expected signal ?
Do you recognize any region where both tracks show enrichment ? Could these correspond to copy-number alterations in the MCF7 cell line ?
If you compare ER ChIP-seq with H3K4me1 ChIP-seq, do you see a difference in the shape of the data (sharper peaks or broader domains of enrichment) ?

4 - Combined coverage file

We want to combine the treatment and input files into one signal file which should indicate the level of signal, taking into account the background noise estimated from the input file. In order to do this, we will use the bamCompare tool from the deepTools toolbox, and use various normalization strategies discussed during the presentation.

Procedure

Select the bamCompare tool
Select the siNT_ER_E2_r3 treatment BAM file and the input BAM file
Choose the SES normalization method in Method to use for scaling the largest sample to the smallest
Choose compute difference (substract input from treatment) in How to compare the two files:
Choose a particular chromosome (chr1)
Repeat the same operation on the H3K4me1 treatment BAM file and the input BAM file

Exercice: Compare the individual coverage files (treatment and input) and the combined one.

(all bamCompare result files are accessible in the Galaxy shared history Herrmann_bamCompare)

5 - Comparing replicates and distinct datasets

When we have replicates, an important check is to what extend they agree between each other. We can compute the correlation of the signal of two datasets in windows over all the genome; this can also be used to compare to distinct datasets and determine, which ones are closest.
We will apply this to the 8 datasets we have (excluding the input datasets), using the tool bamCorrelate of the deepTools toolbox.

Procedure

Find the tool bamCorrelate in the list of tools
Supply the 8 bam files coresponding to treatment (exclude input!)
Select Pearson correlation
Important: restrict to one chromosome (of your choice this time ! Take your favorite chromosome)

Questions :

Which of the samples seem to cluster best ?
What about the replicates ?
Check with your neighbors how it looks like on a different chromosome.

Peak-calling using MACS on ER ChIP-Seq

Goal: Now that we are (hopefully) convinced that the dataset contains signal, we will perform peak calling for the ESR1 ChIP-seq datasets, using the input dataset as control to identify statistically enriched regions (a.k.a. peaks). Peak calling will be performed using MACS (version 1.4.1).

1 - Single replicate

Procedure

Select the tools MACS14 and fill the form as below :
- Experiment name : give a name for the MACS run (siNT_ESR1_r3_MACS).
- Paired end sequencing: MACS can handle single or paired-end data; here we will select single end.
- ChIP-seq tag file : select the BAM file containing the treatment (ChIP): siNT_ER_E2_r3
- ChIP-seq control file : select the BAM file for the input.
- Effective genome size: this is the mappable genome size; default is hg19
- Tag size : these are Illumina datasets of read size 36.
- Diagnosis report: select Produce a diagnosis report.
- All other options should be set to default.

Question:What type of files does MACS return ?

MACS running using these options should generate 2 result files:

A html report describing the model built by MACS, and links to additional files
A bed file containing the peaks.

For the sake of time, we did already run MACS on all the BAM files. The results are available in the Galaxy shared history EBA 2015 ChIP-seq in the folder Examples -> Peaks.

Peaks model
Look at the pdf file generated by MACS: what fragment length has been determined by MACS ? Is this consistent accross replicates/experiments ?

Peaks

How many peaks have been called by MACS ? Use the Line/Word/Character count in the toolbox to count the number of lines in the bed file.
Use the sort tool in the toolbox to sort MACS peaks according to the score.

2 - Consensus set of peaks

Here, we have 3 replicates for each condition, and therefore 3 sets of peaks. We can build a consensus set by determining the peaks that are found in all 3 replicates. This very simple procedure is likely to reduce the number of false positive peaks (keep in mind however that we might also have an increased false negative rate, if one of the replicates departs largely from the others...).

Procedure

Use the tool intersectBed, and make the intersection of the files containing the peaks for siNT for replicate 1 and 2.
Intersect again the resulting file with the peaks of replicate 3
Repeat the same operation with siGATA replicates.

Question: How many consensus peaks do we have for each condition ?

Differential analysis

Goal: Treatment of MCF-7 cells using siRNA to GATA3 is expected to induce a re-localization of ER binding sites. Hence, we want to compare the 2 consensus sets of peaks to determine common/specific peaks. We will compare a "naive" approach with a more quantitative approach.

1 - Simple approach

Procedure

Use the tool intersectBed, and make the intersection of the two consensus peak sets for siNT and siGATA

Question:

How many common peaks do we have ?
How many specific peaks do we have for siNT and siGATA ?

2 - Quantitative differential analysis using diffBind

Having replicates, we can perform a quantitative analysis to identify differentially bound regions. This method is based on the read counts in certain regions, and the identification of regions that show a significant difference in read counts between 2 conditions. This is very similar to the analysis of differential expression in RNA-seq, and indeed the underlying statistical models are often the same.
Several tools exist, but are not all implemented under Galaxy. We will work with the output file of a tool called diffBind which is part of the Bioconductor packages.
DiffBind works by focusing on peak regions shared between a certain number of samples (here : 2 or 3 samples). These regions are then analyzed for differential binding using either edgeR or DESeq2.

Procedure

The files are located in your Galaxy history as ER.m2.db.bed and ER.m3.db.bed
Determine the number of differential peaks in the 2 files ER.m2.db.bed and ER.m3.db.bed
Import the 2 bed files into the IGV session; your IGV session should contain
- the coverage files for the 3 siNT and siGATA replicates (bigwig, 6 tracks)
- the MACS output files for each replicate (bed, 6 tracks)
- the consensus peak files for both conditions (bed, 2 tracks)
- the differentially bound regions as determined by diffBind (bed, 2 tracks)

Question

Zoom into several differentially bound regions, and look at the peaks determined by MACS.
Can you determine cases in which
- a DB region has MACS peaks for both conditions ?
- a DB region has peaks only for one condition ?
- Regions with MACS peaks in one condition and not the other are NOT differentially bound ?
Make screenshots of these different situation a try to get an explanation for each one of these cases.

Formateurs

Jeux de données

Theodorou V, Stark R, Menon S, Carroll JS GATA3 acts upstream of FOXA1 in mediating ESR1 binding by shaping enhancer accessibility. Genome Research, 23(1):12-22. 2013 [Pubmed] [Article]
données et accession: Liste complète