Introduction to Unix for Bioinformatics

Connect to the computer using your login and password.
Open a terminal. You should find a terminal (terminal, konsole, terminator,...) under the Application Menu: Applications > Accessories > Terminal

Using this terminal you should now be able to communicate with the system through a set of shell commands (your default shell command interpreter should be BASH).

Creating a working directory

Using the mkdir command (make directory) create a working directory named td1_genomic_coordinates in your HOME folder.
Use the cd (change directory) command to enter the td1_genomic_coordinates directory

Solution

	cd ~             # or cd
	pwd              # check the current working directory
	mkdir td1_genomic_coordinates
	cd td1_genomic_coordinates # or cd ~/td1_genomic_coordinates

Downloading RefSeq transcript coordinates

RefSeq transcript

The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences. Here we will download informations about RefSeq transcripts (that is known full-length transcript) related to Mus musculus genome (mm9 version). Refseq transcript can be obtained from UCSC genome browser.

Go to the UCSC web site
In the middle of the left menu select Downloads.
Select the Mouse genome.
In the section July 2007 (mm9), select Annotation database to enter the UCSC ftp web site.
Find the file refGene.txt.gz, and copy the link location (URL) by right-clicking on the link.
In a terminal use the wget command followed by this URL to retrieve the refGene.txt.gz file.
Uncompress the file using the command gunzip command.

Solution

	
	wget http://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/refGene.txt.gz
	gunzip refGene.txt.gz

Go to the UCSC web site
Select Tables in the top menu
Select
- clade: Mammal,
- genome: Mouse,
- assembly: mm9,
- group: Genes and Gene Predictions,
- track: RefSeq Genes,
- table: refGene,
and click describe table schema.

The UCSC Web page describes the column content of the refGene table that we just downloaded. This table contains one line per "object" (gene), and one column per attribute. This description table is of course crucial to understand the content of the refGene.txt file that we downloaded in the previous step.

Use the wc command with -l arguments to count the number of lines in the refGene.txt file
Use the head command with -n arguments to see the first 5 lines of the refGene.txt file.
Use the tail command with -n arguments to see the last 5 lines of the refGene.txt file.
Have a look at the whole file using less (use up/down arrow keys and type "q" to quit))
Using head extract the first line and use a pipe to redirect the output to the od command with -c argument.

Solution

		   
	wc -l refGene.txt
	head -n 5 refGene.txt
	tail -n 5 refGene.txt
	less refGene.txt
	head -n 1 refGene.txt  | od -c

Some descriptive statistics on refSeq transcript using basic unix command

Using the cut command retrieve the 13th column.
Using the cut command retrieve the 13th column and redirect the output to the grep command to find genes whose symbol start with Bcl. Count them by redirecting the output to the wc -l command.
What is the gene that display the highest number of transcripts (use sort and uniq with -c argument). Check the results by typing the gene symbol in the UCSC genome browser (top menu > genomes)
What is the gene that display the highest number of exons.
Some transcrits may be coding (NM_...) or non-coding (NR_...). How many transcripts of each class are available in the refGene.txt file?
How many genes does each chromosome holds?

Solution

	cut -f13 refGene.txt
	cut -f13 refGene.txt | grep "^Bcl"
	cut -f13 refGene.txt | grep "^Bcl" | sort | uniq | wc -l		
	cut -f13 refGene.txt | sort | uniq -c | sort -n | tail -1
	cut -f13 refGene.txt | sort | uniq -c | sort -n | tail -1
	cut -f9,13 refGene.txt | sort -n | tail -1
	cut -f2 refGene.txt  | sed 's/_.*//' | sort | uniq -c
	cut -f3 refGene.txt  | sort | uniq -c | sort -n

The Bedtools suite

What is bedtools

Installing the bedtools suite

Check if the bedtools suite is already installed on your system (type intersectBed in a terminal). If bedtools is not available, follow the procedure below to perform installation.

Transforming refGene file into bed format

First we need to transform the refGene.txt file into a bed to use it as input for the bedtools suite. We will do it with awk.

Upload the resulting bed file into UCSC (top menu, genome > add custom track > choose file > submit). Go to the genome browser to visualize the new track.

Getting promoter regions of each transcript

To obtain promoter regions of each transcript we will first extract the TSS coordinates of each transcript. The TSS will correspond to the txStart or txEnd coordinates depending on wether genes is located on the positive or negative strand respectively ( see refSeq table definition).

Now we will use slopBed to increase the size of each feature (TSS) to define a promoter region around the TSS ([-2000,+500]). Also this can be done with a simple awk command, sploBed will restrict the resizing to the size of the chromosome (i.e. no start < 0 and no end > chromosome size). We thus need to download the length of each chromosome (here from UCSC ftp site).

Now we can run slopBed. We will also reorder features (TSS) according to chromosomes coordinates

Check the results by loading the refGene_Prom_2000_500.bed as a new user track in the UCSC genome browser.

NB: The RSAT web server also proposes advance tools for retrieving promoter regions (excluding coding regions, repeat elements,...). This tools can also be run using UNIX command lines.

Intersecting promoter regions with predicted transcription factor binding sites

The dataset that will be used thereafter was obtained from the following publication:

In this article, the authors scanned vertevrates alignments in search for highly conserved putative transcription factor binding sites (TFBS). The resulting file contains the coordinates of the predicted TFBS (mm9 genome version)

Download the conserved_predicted_TFBS_mm9.bed file
Use intersectBed to find for each promoter region its associated sets of predicted TFBS.
Using grep and sed, select the non-redondant list of targets for transcription factor E2F (represented by M00920:V$E2F_Q6_01 identifier), or ETS (represented by M00771:V$ETS_Q4 identifier), or MEF2A (represented by M00941:V$MEF2_Q6_01 identifier), or NFKB (represented by M00054:V$NFKAPPAB_01 identifier) or NRSF (represented by M01028:V$NRSF_Q4 identifier).

Solution

	wget http://www.bigre.ulb.ac.be/courses/statistics_bioinformatics/data/cis-reg/conserved_predicted_TFBS_mm9.bed
	head conserved_predicted_TFBS_mm9.bed
	intersectBed -wa -wb -a refGene_Prom_2000_500.bed -b conserved_predicted_TFBS_mm9.bed > conservedsites_prom_intersect.bed
	grep  -w M00920 conservedsites_prom_intersect.bed   | sed 's/.*|//' | cut -f1 | sort | uniq > M00920_targets.txt
	grep  -w M00771  conservedsites_prom_intersect.bed  | sed 's/.*|//' | cut -f1 | sort | uniq > M00771_targets.txt
	grep  -w M00941  conservedsites_prom_intersect.bed  | sed 's/.*|//' | cut -f1 | sort | uniq >  M00941_targets.txt
	grep  -w M00054  conservedsites_prom_intersect.bed  | sed 's/.*|//' | cut -f1 | sort | uniq >  M00054_targets.txt
	grep  -w M01028  conservedsites_prom_intersect.bed  | sed 's/.*|//' | cut -f1 | sort | uniq >  M01028_targets.txt

Functional annotation using Database for Annotation, Visualization and Integrated Discovery

We have seen that each transcription factor is associated with a set of putative targets. What can we learn from these targets? Is the transcription factor associated to genes involved in some particular biological functions?

To answer this question we will use the Database for Annotation, Visualization and Integrated Discovery (DAVID). DAVID will try to find whether the selected genes share some particular features, by analyzing their annotations from several sources including Gene Ontology, KEGG pathways, Reactome Pathways, genomic locations, ...

Go to the DAVID web site.
In the left menu select Functional annotation.
In the left menu select the upload tab. Paste the list of genes associated to one of the transcription factors.
Select step 2 > OFFICIAL_GENE_SYMBOL.
Select step 3 > Gene List.
Press Submit List.
The program now displays the "Gene list manager", with a list of species sorted by decreasing number of matching gene names. Select "Mus musculus" and click "Select species".
Optionally, you can click on the list under "List manager" and rename if (for example, name it "M00920 E2F mouse").
In the "Annotation Summary Results" page, click the button Functional Annotation Chart.
In the new window, select option and ask for Fold Enrichment, Bonferroni, Benjamini, FDR, Fisher Exact, LT,PH,PT

What can you guess from the results?

Which are the most significant functional classes?
How do you interpret their P-value?
Is the significance proportional to the fold enrichment?
Compare the P-values obtained after correction for multiple testing (Boneroni, Benjamini).
Evaluate the results at the bottom of the table: can these associations be considered as significant? Why?

Gene list

The list of E2F target genes predicted with the matrix M00920 is provided in the following file.

M00920_targets.txt

Interpretation

The figure below shows a snapshot of the top rows of functional annotation chart returned by David.

The most significant association for our list of genes is the KEGG pathway "Cell cycle". The second row indicates a significant intersection with another source of annotations: the Gene Ontology term "cell cycle". Although these two functional classes have the same name, they come from different databases, with different annotation strategies. To avoid redundancy, the interface of DAVID allows you to selectively display some subsets of the Gene Ontology annotations.

Let us analyze in detail the second result row (GOTERM_BP_FAT cell cycle).

The column "count" gives us the most directly interpretable data: 19 of the 80 genes submitted to David are associated to the term "cell cycle" in the GOTERM_BP_FAT (Gene Ontology, Biological Processes, "FAT" annotations) annotations. Let us denote this number by the symbol x=19.
The column PT indicates the total number of genes having some annotation in GOTERM_BP_FAT (N=13588). This does not cover the total set of human genes (~25,000). However, since the other genes have no annotation in GOTERM_BP_FAT, we would not be able to test if they do or not belong to the cell cycle annotations. We will thus consider that the "universe" of possibilities is restricted to the N genes having at least some annotation in GOTERM_BP_FAT.
The column LT indicates that, among the 80 genes submitted to DAVID, only k=59 have some annotation in GOTERM_BP_FAT. This is the maximal number of genes which could match some biological process in GO annotations.
The colum PH indicates the number of genes associated to the cell cycle in GOTERM_BP_FAT: m=611.

Introduction to Unix for Bioinformatics

Handling genomic coordinates

Contents

Opening a session.

Starting a terminal

Creating a working directory

Solution

Downloading RefSeq transcript coordinates

RefSeq transcript

Solution

Solution

Some descriptive statistics on refSeq transcript using basic unix command

Solution

The Bedtools suite

What is bedtools

Installing the bedtools suite

Transforming refGene file into bed format

Getting promoter regions of each transcript

Intersecting promoter regions with predicted transcription factor binding sites

Solution

Functional annotation using Database for Annotation, Visualization and Integrated Discovery

Gene list

Interpretation