Analysis of ChIP-seq data

For this tutorial we will use CHiP-Seq datasets produced by Theodorou et al. The authors used ChIP-Seq technology in order to systematically identify ESR1 binding regions across the human genome. Importantly, they demonstrated that knock-down of GATA3 through siRNA greatly affect ESR1 binding. The corresponding abstract of this article is provided below.

Abstract

Estrogen receptor (ESR1) drives growth in the majority of human breast cancers by binding to regulatory elements and inducing transcription events that promote tumor growth. Differences in enhancer occupancy by ESR1 contribute to the diverse expression profiles and clinical outcome observed in breast cancer patients. GATA3 is an ESR1-cooperating transcription factor mutated in breast tumors; however, its genomic properties are not fully defined.

In order to investigate the composition of enhancers involved in estrogen-induced transcription and the potential role of GATA3, we performed extensive ChIP-sequencing in unstimulated breast cancer cells and following estrogen treatment. We find that GATA3 is pivotal in mediating enhancer accessibility at regulatory regions involved in ESR1-mediated transcription. GATA3 silencing resulted in a global redistribution of cofactors and active histone marks prior to estrogen stimulation. These global genomic changes altered the ESR1-binding profile that subsequently occurred following estrogen, with events exhibiting both loss and gain in binding affinity, implying a GATA3-mediated redistribution of ESR1 binding. The GATA3-mediated redistributed ESR1 profile correlated with changes in gene expression, suggestive of its functionality. Chromatin loops at the TFF locus involving ESR1-bound enhancers occurred independently of ESR1 when GATA3 was silenced, indicating that GATA3, when present on the chromatin, may serve as a licensing factor for estrogen-ESR1-mediated interactions between cis-regulatory elements. Together, these experiments suggest that GATA3 directly impacts ESR1 enhancer accessibility, and may potentially explain the contribution of mutant-GATA3 in the heterogeneity of ESR1+ breast cancer.

Getting informations about the experiment using the GEO and SRA websites

Gene Expression Omnibus (GEO) is a public repository that provide tools to submit, access and mine functional genomics data. Data may be related to array- or sequence-based technologies. For HTS data, GEO provides both processed data (such as *.bam, *.bed, *.wig files) and links to raw data. Raw data are available from the Sequence Read Archive (SRA) database (including 454, IonTorrent, Illumina, SOLiD, Helicos and Complete Genomics). Both web sites propose search engines to query their databases.

Go to GEO web site.
Choose "Search" and paste GSE40129 (GSE stands for GEO Series Experiment). Click "GO" to get information about this experiment.
In the "sample section" (middle of the page), click on "More" to visualize all sample names.
Click on GSM986059 hyperlink (GSM stands for GEO SaMple) to get information about this sample.
In the "relations" section, select "SRX176856" hyperlink to open the SRA page corresponding to this sample.
Click on the SRR link (bottom right) to access the record of the run.
On the new page, click on the Reads tab to view the read sequence (you can display the quality clicking on Customize).
From there, you might also download the dataset as a .sra file, but we will not do it in the context of this practical (beware, this would take time and occupy disk space, since SRA files typically weight several hundred Mb !).

What is the HTS platform used to sequence this sample ?
Is this experiment single-end or paire-end sequencing ?
How many runs (i.e. lanes) are associated to this sample ?
How many reads were produced (# of Spots) ?
Select SRR540192 hyperlink. What is the sequence of the first read ?

Connecting to the Galaxy server

Open a connection to pedagogix Galaxy server.
Enter your login (command Login in the menu User at the top of the Galaxy window). If this is your first connection, use the Register command.

Quality control of sequencing data

Loading fastq files in galaxy

Analysis of the whole dataset can be time consuming. Thus, in order to illustrate the mapping procedure, data were previously retrieved from SRA, fastq-transformed using SRA toolkit (fastq-dump command) and mapped to the human genome. A subset of reads that aligned onto chromosome 21 was extracted and will be used for this tutorial. Although analysis can be performed programmatically (using a shell script for instance), here, we will use the Galaxy framework. A subset of the run SRR540192 (ChIP Estrogen Receptor on MCF-7 treated with E2) is available for download (see below). The input will be processed in the later sections.

In the upper left corner, click on Unnamed history and rename this workspace to 'ER_chr21_mapping'.
In the left menu, select Get Data > Upload File. set File format as BED, copy and paste the URL below in the URL/text area and set genome to hg19.

http://denis.puthier.perso.luminy.univ-amu.fr/COURSES/CHIP-SEQ/PRACTICAL/data/siNT_ER_E2_r3_SRX176860_chr21_0.6_Noise.fastq.gz

Select NGS TOOLS > NGS: QC and manipulation > FASTQ Groomer. Set File to groom to 'siNT_ER_E2_r3.SRR540192.chr21.andNoise.fastq', and press Execute
Rename the output of the previous step into 'siNT_ER_E2_r3.chr21.groom' (use the pencil that is associated to each item of the history to edit their attributes).

Quality control with FastQC

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. FastQC can be run as a stand alone interactive application for the immediate analysis of small numbers of FastQ files, in a non-interactive mode (through shell commands) where it would be suitable for integrating into a larger analysis pipeline for the systematic processing of large numbers of files or through the Galaxy framework.

It is important to stress that although the analysis results appear to give a pass/fail result, these evaluations must be taken in the context of what you expect from your library. A 'normal' sample as far as FastQC is concerned is random and diverse. Some experiments may be expected to produce libraries which are biased in particular ways. You should treat the summary evaluations therefore as pointers to where you should concentrate your attention and understand why your library may not look random and diverse.

Use NGS TOOLS > NGS: QC and manipulation > FastQC:Read QC.
Select 'siNT_ER_E2_r3.chr21.groom' in Short read data from your current history dropdown list. Press execute.
Display the data for the corresponding result in your history (right panel).

Read trimming and filtering

Read trimming is a pre-processing step in which input read ends with poor quality values are cut (most generally the right end). However one should keep in mind that this step is crucial when working with numerous aligners such as bowtie. Indeed as bowtie does not perform "hard-clipping" (that is clip sequence NOT present in the reference) it may be unable to align a large fraction of the dataset when poor quality ends are kept. Several software may be used to perform sequence trimming :

Search for the sickle tool using the galaxy search engine (upper left corner). Select sickle tool.
From reads fastq file dropdown list select'siNT_ER_E2_r3.SRR540192.chr21.andNoise.groom'. Set Quality Threshold to 20, Length Threshold to 25, min_len to 25 and press execute.
Rename the output into siNT_ER_E2_r3.chr21.sickle
Perform a new fastqc analysis using the trimmed read as input. The number of reads should be reduced.
Check the proportion of duplicate reads ('Sequence Duplication Levels'). High level of PCR duplicates means that you provided to little material for sequencing (poor library complexity).

Mapping reads with bowtie

Among the genome aligners, bowtie is one of a most popular mostly because it can achieve fast alignment of millions of reads. Although, the mapping strategy differs between version 1 and 2, the overall pipeline is identical. Bowtie uses a "seed and extend" strategy meaning that it will first try to find matches for 5' ends of the reads (the seeds, whose length is controlled through -l arguments) in the reference genome (using an index build using Burrows Wheeler Transform algorithm). In the second step, it will try to extend these matches using dynamic programming.

Bowtie offers many parameters that can modify the way alignment is performed. In the case of ChIP-Seq analysis, one crucial issue is to control for multi-reads (reads that map to several positions onto the reference genome) that may produce artificial peaks. This parameter may be controlled trough the -m arguments. Here, we will instruct bowtie to discard multihits (although more advanced policies have been proposed).

From the tool panel, select Get data > Upload File. Fill the form as follow:

Set File Format as "fasta".
In the URL/Text: area, copy and paste: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/chr21.fa.gz. Click Execute.
Set genome to hg19 and press execute.

From the tool panel, select NGS TOOLS > NGS: Mapping > Map with Bowtie for Illumina. Fill the form as follow:.
- Set Will you select a reference genome from your history or use a built-in index to Use one from the history.
- Set Select the reference genome to chr21.fa.
- Set Bowtie settings to use to 'full parameter list'.
- Set Suppress all alignments for a read if more than n reportable alignments exist to '1'.
- Select the flagstat tool from the toolbox to compute some simple statistics about read mapping.
- Download the result from bowtie (a sam file).

Viewing the results with Integrated Genome Browser (IGV).

The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.

Download IGV and launch it with 750 MB or 1.2 Gb depending of your machine.
Select hg19 genome and go to chromosome 21.
Select Tools > Run Igvtools. Select command > sort, select iput file and browse to the sam file. Press Run.
Select command > index, select input file and browse to the sorted sam file. Press Run.
Select command > count, select input file and browse to the sorted sam file (! not the *.bai)). Press Run.
Close the igvtools window.
Load the tdf and sorted sam file.
Go to TFF1 gene.
Unzoom and and select regions displaying high signal based on tdf track

NB:The tdf file is a IGV specific format that is closed to the bigWig format (the compressed version of wig format).

Mapping reads with bowtie (input)

Using the same procedure, align reads obtained from the input sample on hg19 genome. The input will be used to model the local genomic background in the Peak-Calling step.

Input reads can be obtained are available from Shared data > data libraries > tp-mardi-chipseq-herrmann > fastq MCF-7_input_r3.SRR540220.chr21.andNoise.fastq.

Quality filtering and read mapping

Contents

CHiP-Seq dataset description