Discovering the human genome with UNIX - Statistics on genes and transcripts

Bioinformatics is highly dependent on Linux-based computers and software. First, with the advent of new technologies (e.g. sequencing), analysis tend to rely more and more on powerful computing clusters. Second, Linux/UNIX systems come with pre-installed versions of most popular programing languages and thus tend to become the natural platform for newly developed software(some programs are exclusively developed for Linux-based systems). Thus, mastering the UNIX/Linux environment and interface is a critical step required to perform lots of bioinformatics analysis.

In the present session we will aim at computing some basic statistics on the human genome (number of transcripts, number of genes, number of genes per chromosomes, number of exons,...)

To this aim we will use information produced by The Reference Sequence Database (RefSeq). This project provides a comprehensive but not redundant list of known transcripts. Refseq transcript coordinates will be obtained from UCSC genome browser.

Connection

Starting a session.

Launching a terminal

Improving the terminal display

Before going any further...

The first time you open a terminal, it may be rather difficult to read the information displayed in the window. To improve readability we will configure the terminal by modifying the .bashrc file located in your home directory. This file can contain various information related to the behavior of the commands.

First, in order to make a backup, we will copy (cp) the .bashrc file (hiden file) that is located in your home directory (~) to .bashrc_back (hiden file) in your home directory.

NB: the # character indicates the beginning of a comment. The text following the comment char is ignored by the system. In the following examples, the actual commands are followed by comments to explain their usage.

At this step, the list should contain a .bash_rc file (if not, please ask the instructors).

Use the following command which tells the system to start gedit, a program that can be used to edit the content of the .bashrc file.

Tips

Add your settings at the end of the file, and do not delete the other lines.
The gedit text editor supports synctactic coloring, which is particularly useful for viewing and editing programming files. To highlight the syntactic elements of the bash file, activate the following option in gedit menu.
View -> Highlight mode -> Scripts -> sh
Do not worry if at this stage of the course, the following lines are not understandable for you. We perform this operation to improve the comfort for our future work, but these commands are already somewhat advanced seddings.

Save the file, close the terminal and open a new one (alternatively, you can stay in the existing terminal, and run the command source ~/.bashrc).

Creating a result directory

In your home directory, we will create a sub-directory called TD01_Bioinfo to store the results of the practicals. This will be used to store the datasets used in this practical session.

First, use the cd command to go into your home directory. Then use the pwd (print working directory) command that can be used to display the path of the current working directory (the one where you are currently located).

Now use the mkdir command to create a directory named TD01_Bioinfo.
Check that the directory has been created using the ls command.
Enter into the TD01_Bioinfo directory using the cd command.
Use the mkdir command to create a sub-directory named hg38_transcripts.
Enter into the hg38_transcripts directory using the cd command.
Check the current value of your working directory.

Solution

	
mkdir TD01_Bioinfo
ls 
cd TD01_Bioinfo
pwd
mkdir hg38_transcripts
cd hg38_transcripts
pwd

The result should be something like this:
/home/[your_login]/TD01_Bioinfo/hg38_transcripts

Retrieving RefSeq transcript coordinates

We will retrieve information on known human transcripts. This dataset is provided by the Reference Sequence Database (RefSeq) and available for download through the UCSC server.

Retrieving RefSeq annotations

Using your Web Browser, go to the UCSC web site.
In the middle of the left menu, select Downloads.
Select the Human genome.
In the "Dec. 2013 (hg38, GRCh38)" section, select "Annotation database". This will connect you to the UCSC FTP web site. After a rather long explanatory text, you will find an even longer list of available files.
Find the file refGene.txt.gz. Right-click on the link and select "Copy link adress" to get the URL.
In a terminal check that you are located in the hg38_transcripts directory.
Still in the terminal use the wget command followed by the URL to retrieve the refGene.txt.gz file.
Use the ls command to ensure that a file was created. You should see one file: refGene.txt.gz.
Get some help on the gunzip command. What is it for ?
Uncompress the file using the gunzip command.
Use the ls command to ensure that the file was uncompressed (the file should have been renamed without the .gz extension).
Rename (mv) the refGene.txt file to refGene_hg38.txt.
Use the ls command with -l argument to check that the file was rename. What are the permissions on refGene_hg38.txt file?

Solution

	
pwd                                                                         # check location
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz # download
ls                                                                          # get the list of files
man gunzip                                                                  # q to quit 
gunzip refGene.txt.gz                                                       # uncompress
ls                                                                          # no .gz extension in file name
mv refGene.txt refGene_hg38.txt                                             # rename file to refGene_hg38.txt
ls                                                                          # get the list of files

What does the RefGene_hg38.txt file contain?

Getting a description of each column

The file downloaded from UCSC (refGene_hg38.txt) does not contain any column name. An extensive description of these columns can be obtained through the Table Browser from the UCSC server.

Each transcript is associated to a set of annotations. For instance, the file will contain information about the chromosome and coordinates of the transcribed region (start and end positions), the strand (plus or minus), the number of exons...

Using your internet browser, go to the UCSC web site
Select Tables in the top menu
Select Clade: Mammal, Genome: Human, assembly: "Dec. 2013 (hg38, GRCh38)", group: Genes and Gene Prediction tracks, table: refGene. Click on describe table schema
Each line of the table describes a given column of the refGene file.
Leave this window open so that you can go back to it if needed.

Inspecting the file content

First lines, last lines...

Count the number of lines of the file refGene_hg38.txt.
Use the head command with -n arguments to see the first 5 lines of the file refGene_hg38.txt.
Use the tail command with -n arguments to see the last 5 lines of the file refGene_hg38.txt.
Have a look at the whole file using less (use up/down arrow keys and type "q" to quit))
Use the head command then the tail command (use a pipe) to select the 10th line of the file refGene_hg38.txt.
Extract the first line using head and use a pipe to redirect the output to the od command with -c argument. This will tell you the kind of character that are enclosed in the file. Based on the first line, does it seems to be a tabulated file ?

Solution

		   
wc -l refGene_hg38.txt
head -n 5 refGene_hg38.txt
tail -n 5 refGene_hg38.txt
less refGene_hg38.txt
head -n 10 refGene_hg38.txt | tail -n 1
head -n 1 refGene_hg38.txt  | od -c 	# columns seem to be separated by tabulations ("\t"). This is a tabulated file.

Some statistics on transcript and genes

Statistics on genes

Our knowledge of the human genome is evolving. New genes are still being discovered. In the file refGene_hg38.txt, each line corresponds to a transcript whose name/identifier is provided in the 2nd column. Note that a gene can correspond to one or more transcript. Indeed, some genes have several alternative promoters, or can be spliced differentially. The number of lines we counted in the previous exercise indicates the number of transcripts identified in the version 38 of the human genome (hg38).

The symbol/identifier of the gene associated to this transcript is annotated in the 13th column of the RefGenes file. What about our current knowledge of human transcripts and genes ?

What is the number of distinct genes in the version 38 of the human genome ? The result can be found by computing the number of non redondant gene symbols in the 13th column.

Statistics on transcripts

In a previous exercise, we used the wc command with the -l argument to count the number of lines (and thus the number of known transcripts in human) of the file refGene_hg38.txt.
Among them, how many coding transcripts have been identified in this version of the genome ? Using cut, grep and wc -l, select the 2nd column, select the lines starting with NM_ and count them (use pipe characters to concatenate the commands).
With almost the same command line, count the number of known non-coding transcripts.

Solution

wc -l refGene_hg38.txt                          # How many transcripts
cut -f2 refGene_hg38.txt | grep "^NM_" | wc -l  # How many coding transcripts
cut -f2 refGene_hg38.txt | grep "^NR_" | wc -l  # How many non-coding transcripts

Density of genes per chromosome

Chromosome sizes

Before looking at the density of genes per chromosome we will have a look at chromosome sizes. This information can be retrieved from the UCSC server (file chromInfo.txt.gz).

Using your internet browser, go to the UCSC web site
In the middle of the left menu select Downloads.
Select the Human genome.
In the "Dec. 2013 (hg38, GRCh38)" section select "Annotation database" to enter the UCSC ftp web site.
Find the file chromInfo.txt.gz. Right-click on the link and select "Copy link adress" to get the URL.
In a terminal check that you are located in the hg38_transcripts directory.
Still in the terminal use the wget command followed by the URL to retrieve the chromInfo.txt.gz file.
Use the ls command to ensure that a file was created. You should see the file chromInfo.txt.gz.
Uncompress the file using the gunzip command.
Use the ls command to ensure that the file was uncompressed (the file should have been renamed without the .gz extension).
Rename (mv) the chromInfo.txt file to chromInfo_hg38.txt
Using grep select lines containing informations regarding 'regular' chromosomes (chr1 to chr22, chrY and chrX) then use sort with -r, -n, and -k arguments to get the size of chromosomes in descending order. Use the manual (man) to get more information about the -n and -r and -k arguments of sort. Store the results in a file named chromInfo_hg38_regular.txt (use redirection operator >).
Inspect file content using the command less.

Do you see a relationship between chromosome names and sizes?

Solution

pwd
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/chromInfo.txt.gz 			# download chromosome informations
ls 																						# check that the file was downloaded
gunzip  chromInfo.txt.gz																# uncompress
mv chromInfo.txt.gz chromInfo_hg38.txt													# rename
grep -P "chr[0-9XY]+\t" chromInfo_hg38.txt| sort -rnk2 > chromInfo_hg38_regular.txt		# Select 'regular' chromosomes and perform a numerical sort (reverse order)
less chromInfo_hg38_regular.txt															# Inspect file content

Number of genes per chromosome

What is the number of genes per chromosome ? Take only 'regular' chromosomes into account.
Given this result and the size of chromosomes, what can you say ?

Solution

cut -f3,13 refGene_hg38.txt  | grep -P "chr[0-9XY]+\t" | sort | uniq | cut -f1 | uniq -c | sort -rn # number of genes per chromosome
    # Some chromosomes (e.g chr19) tend to be gene-rich

Some statistics on gene structures

Number of exons

What is the gene that contains the maximum number of exons ?
Using your Web browser, go to the UCSC web site. In the top menu, select Genomes. Select group: Mammal, genome: Human assembly:hg38. In the search term area paste the name of the gene. Select the first transcript of the list to visualize its genomic structure.

Writing a script

For instance we can write a code to compute the number of genes in hg38. For that you will need to download the refGene.txt file in an approriate folder (e.g. ~/genome/hg38), uncompress it, compute the number of genes and write it to a file.

View solution| Hide solution

Solution

          #It's your turn to find the solution...

Providing arguments to the script.

For instance we can write a code to compute the number of genes in hg38. For that you will need to download the refGene.txt file in an approriate folder (e.g. ~/genome/hg38), uncompress it, compute the number of genes and write it to a file.
The link for hg38 is http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
It could thus be written as http://hgdownload.soe.ucsc.edu/goldenPath/${GENOME}/database/refGene.txt.gz

View solution| Hide solution

Solution

             #It's your turn to find the solution...

Discovering the human genome with UNIX - Basic statistics on genes and transcripts.

Contents

Introduction

Connection

Starting a session.

Launching a terminal

Improving the terminal display

Before going any further...

Tips

Creating a result directory

Solution

Retrieving RefSeq transcript coordinates

Retrieving RefSeq annotations

Solution

What does the RefGene_hg38.txt file contain?

Getting a description of each column

Inspecting the file content

First lines, last lines...

Solution

Some statistics on transcript and genes

Statistics on genes

Solution

Statistics on transcripts

Solution

Density of genes per chromosome

Chromosome sizes

Solution

Number of genes per chromosome

Solution

Some statistics on gene structures

Number of exons

Solution

Writing a script

Solution

Providing arguments to the script.

Solution

Looping

Solution