Quality filtering and read mapping


CHiP-Seq dataset description

For this tutorial we will use CHiP-Seq datasets produced by Theodorou et al. The authors used ChIP-Seq technology in order to systematically identify ESR1 binding regions across the human genome. Importantly, they demonstrated that knock-down of GATA3 through siRNA greatly affect ESR1 binding. The corresponding abstract of this article is provided below.


Estrogen receptor (ESR1) drives growth in the majority of human breast cancers by binding to regulatory elements and inducing transcription events that promote tumor growth. Differences in enhancer occupancy by ESR1 contribute to the diverse expression profiles and clinical outcome observed in breast cancer patients. GATA3 is an ESR1-cooperating transcription factor mutated in breast tumors; however, its genomic properties are not fully defined.

In order to investigate the composition of enhancers involved in estrogen-induced transcription and the potential role of GATA3, we performed extensive ChIP-sequencing in unstimulated breast cancer cells and following estrogen treatment. We find that GATA3 is pivotal in mediating enhancer accessibility at regulatory regions involved in ESR1-mediated transcription. GATA3 silencing resulted in a global redistribution of cofactors and active histone marks prior to estrogen stimulation. These global genomic changes altered the ESR1-binding profile that subsequently occurred following estrogen, with events exhibiting both loss and gain in binding affinity, implying a GATA3-mediated redistribution of ESR1 binding. The GATA3-mediated redistributed ESR1 profile correlated with changes in gene expression, suggestive of its functionality. Chromatin loops at the TFF locus involving ESR1-bound enhancers occurred independently of ESR1 when GATA3 was silenced, indicating that GATA3, when present on the chromatin, may serve as a licensing factor for estrogen-ESR1-mediated interactions between cis-regulatory elements. Together, these experiments suggest that GATA3 directly impacts ESR1 enhancer accessibility, and may potentially explain the contribution of mutant-GATA3 in the heterogeneity of ESR1+ breast cancer.

Getting informations about the experiment using the GEO and SRA websites

Gene Expression Omnibus (GEO) is a public repository that provide tools to submit, access and mine functional genomics data. Data may be related to array- or sequence-based technologies. For HTS data, GEO provides both processed data (such as *.bam, *.bed, *.wig files) and links to raw data. Raw data are available from the Sequence Read Archive (SRA) database (including 454, IonTorrent, Illumina, SOLiD, Helicos and Complete Genomics). Both web sites propose search engines to query their databases.

Connecting to the Galaxy server

Quality control of sequencing data

Loading fastq files in galaxy

Analysis of the whole dataset can be time consuming. Thus, in order to illustrate the mapping procedure, data were previously retrieved from SRA, fastq-transformed using SRA toolkit (fastq-dump command) and mapped to the human genome. A subset of reads that aligned onto chromosome 21 was extracted and will be used for this tutorial. Although analysis can be performed programmatically (using a shell script for instance), here, we will use the Galaxy framework. A subset of the run SRR540192 (ChIP Estrogen Receptor on MCF-7 treated with E2) is available for download (see below). The input will be processed in the later sections.

Quality control with FastQC

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. FastQC can be run as a stand alone interactive application for the immediate analysis of small numbers of FastQ files, in a non-interactive mode (through shell commands) where it would be suitable for integrating into a larger analysis pipeline for the systematic processing of large numbers of files or through the Galaxy framework.

It is important to stress that although the analysis results appear to give a pass/fail result, these evaluations must be taken in the context of what you expect from your library. A 'normal' sample as far as FastQC is concerned is random and diverse. Some experiments may be expected to produce libraries which are biased in particular ways. You should treat the summary evaluations therefore as pointers to where you should concentrate your attention and understand why your library may not look random and diverse.

Read trimming and filtering

Read trimming is a pre-processing step in which input read ends with poor quality values are cut (most generally the right end). However one should keep in mind that this step is crucial when working with numerous aligners such as bowtie. Indeed as bowtie does not perform "hard-clipping" (that is clip sequence NOT present in the reference) it may be unable to align a large fraction of the dataset when poor quality ends are kept. Several software may be used to perform sequence trimming :


Here we will use sickle.

Mapping reads with bowtie

Among the genome aligners, bowtie is one of a most popular mostly because it can achieve fast alignment of millions of reads. Although, the mapping strategy differs between version 1 and 2, the overall pipeline is identical. Bowtie uses a "seed and extend" strategy meaning that it will first try to find matches for 5' ends of the reads (the seeds, whose length is controlled through -l arguments) in the reference genome (using an index build using Burrows Wheeler Transform algorithm). In the second step, it will try to extend these matches using dynamic programming.

Bowtie offers many parameters that can modify the way alignment is performed. In the case of ChIP-Seq analysis, one crucial issue is to control for multi-reads (reads that map to several positions onto the reference genome) that may produce artificial peaks. This parameter may be controlled trough the -m arguments. Here, we will instruct bowtie to discard multihits (although more advanced policies have been proposed).

Viewing the results with Integrated Genome Browser (IGV).

The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.

NB:The tdf file is a IGV specific format that is closed to the bigWig format (the compressed version of wig format).

Mapping reads with bowtie (input)

Using the same procedure, align reads obtained from the input sample on hg19 genome. The input will be used to model the local genomic background in the Peak-Calling step.