AMU :: Polytech Biotech 3A :: JGB53D :: 2014/2015 :: Jacques van Helden, Denis Puthier, Nicolas Terrapon

Discovering the human genome with UNIX - Basic statistics on genes and transcripts.

Contents


Introduction

Bioinformatics is highly dependent on Linux-based computers and software. First, with the advent of new technologies (e.g. sequencing), analysis tend to rely more and more on powerful computing clusters. Second, Linux/UNIX systems come with pre-installed versions of most popular programing languages and thus tend to become the natural platform for newly developed software(some programs are exclusively developed for Linux-based systems). Thus, mastering the UNIX/Linux environment and interface is a critical step required to perform lots of bioinformatics analysis.

In the present session we will aim at computing some basic statistics on the human genome (number of transcripts, number of genes, number of genes per chromosomes, number of exons,...)

To this aim we will use information produced by The Reference Sequence Database (RefSeq). This project provides a comprehensive but not redundant list of known transcripts. Refseq transcript coordinates will be obtained from UCSC genome browser.


Connection

Starting a session.

Connect to the computer using your login and password. Remember that the UNIX is case-sensitive. When you type your login and password (and more generally), be sure to use upper and lower case characters accurately.

Launching a terminal

Start a terminal using the following menu:

Applications > Accessoires > terminal

Improving the terminal display

Before going any further...

The first time you open a terminal, it may be rather difficult to read the information displayed in the window. To improve readability we will configure the terminal by modifying the .bashrc file located in your home directory. This file can contain various information related to the behavior of the commands.

First, in order to make a backup, we will copy (cp) the .bashrc file (hiden file) that is located in your home directory (~) to .bashrc_back (hiden file) in your home directory.

Type the following command in the terminal.

NB: the # character indicates the beginning of a comment. The text following the comment char is ignored by the system. In the following examples, the actual commands are followed by comments to explain their usage.

  ls -l ~/.bash* 					# Check that the file exists
  cp ~/.bashrc ~/.bashrc_backup		# Creating a backup. In case...

At this step, the list should contain a .bash_rc file (if not, please ask the instructors).

Use the following command which tells the system to start gedit, a program that can be used to edit the content of the .bashrc file.

  gedit ~/.bashrc

Copy and paste the following lines at the end of your your .bashrc file.

Tips

################################################################
## Personal configuration for the bash prompt

# We declare a new command 'll'. Equivalent to 'ls -l'
alias ll="ls -l" 	
  
# When using the 'ls' command, file and directory names will be colored.
alias ls="ls --color" 
  
# If the rm command is used the system will ask before... (but don't use it please)
alias rm="rm -i"	
  
# Changing the prompt display (don't worry it this line seems obscure to you)
export PS1="\[\033[01;34m\]\h\[\033[00m\]\[\033[01;32m\] \[\033[01;32m\]\u \[\033[00;33m\]\w\n\[\033[01;30m\e[0m\e[1;00m\]$ "

Save the file, close the terminal and open a new one (alternatively, you can stay in the existing terminal, and run the command source ~/.bashrc).

Your terminal is now ready.


Creating a result directory

In your home directory, we will create a sub-directory called TD01_Bioinfo to store the results of the practicals. This will be used to store the datasets used in this practical session.

First, use the cd command to go into your home directory. Then use the pwd (print working directory) command that can be used to display the path of the current working directory (the one where you are currently located).

  cd ~ ## or cd
  pwd
View solution| Hide solution

Retrieving RefSeq transcript coordinates

We will retrieve information on known human transcripts. This dataset is provided by the Reference Sequence Database (RefSeq) and available for download through the UCSC server.

The downloaded file should contain various information about these transcripts.

Retrieving RefSeq annotations

  1. Using your Web Browser, go to the UCSC web site.
  2. In the middle of the left menu, select Downloads.
  3. Select the Human genome.
  4. In the "Dec. 2013 (hg38, GRCh38)" section, select "Annotation database". This will connect you to the UCSC FTP web site. After a rather long explanatory text, you will find an even longer list of available files.
  5. Find the file refGene.txt.gz. Right-click on the link and select "Copy link adress" to get the URL.
  6. In a terminal check that you are located in the hg38_transcripts directory.
  7. Still in the terminal use the wget command followed by the URL to retrieve the refGene.txt.gz file.
  8. Use the ls command to ensure that a file was created. You should see one file: refGene.txt.gz.
  9. Get some help on the gunzip command. What is it for ?
  10. Uncompress the file using the gunzip command.
  11. Use the ls command to ensure that the file was uncompressed (the file should have been renamed without the .gz extension).
  12. Rename (mv) the refGene.txt file to refGene_hg38.txt.
  13. Use the ls command with -l argument to check that the file was rename. What are the permissions on refGene_hg38.txt file?
View solution| Hide solution

What does the RefGene_hg38.txt file contain?

Getting a description of each column

The file downloaded from UCSC (refGene_hg38.txt) does not contain any column name. An extensive description of these columns can be obtained through the Table Browser from the UCSC server.

Each transcript is associated to a set of annotations. For instance, the file will contain information about the chromosome and coordinates of the transcribed region (start and end positions), the strand (plus or minus), the number of exons...

  1. Using your internet browser, go to the UCSC web site
  2. Select Tables in the top menu
  3. Select Clade: Mammal, Genome: Human, assembly: "Dec. 2013 (hg38, GRCh38)", group: Genes and Gene Prediction tracks, table: refGene. Click on describe table schema
  4. Each line of the table describes a given column of the refGene file.
  5. Leave this window open so that you can go back to it if needed.

Inspecting the file content

First lines, last lines...

  1. Count the number of lines of the file refGene_hg38.txt.
  2. Use the head command with -n arguments to see the first 5 lines of the file refGene_hg38.txt.
  3. Use the tail command with -n arguments to see the last 5 lines of the file refGene_hg38.txt.
  4. Have a look at the whole file using less (use up/down arrow keys and type "q" to quit))
  5. Use the head command then the tail command (use a pipe) to select the 10th line of the file refGene_hg38.txt.
  6. Extract the first line using head and use a pipe to redirect the output to the od command with -c argument. This will tell you the kind of character that are enclosed in the file. Based on the first line, does it seems to be a tabulated file ?
View solution| Hide solution

Some statistics on transcript and genes

Statistics on genes

Our knowledge of the human genome is evolving. New genes are still being discovered. In the file refGene_hg38.txt, each line corresponds to a transcript whose name/identifier is provided in the 2nd column. Note that a gene can correspond to one or more transcript. Indeed, some genes have several alternative promoters, or can be spliced differentially. The number of lines we counted in the previous exercise indicates the number of transcripts identified in the version 38 of the human genome (hg38).

The symbol/identifier of the gene associated to this transcript is annotated in the 13th column of the RefGenes file. What about our current knowledge of human transcripts and genes ?

  1. What is the number of distinct genes in the version 38 of the human genome ? The result can be found by computing the number of non redondant gene symbols in the 13th column.

Use the commands sort and uniq.

View solution| Hide solution

Statistics on transcripts

  1. In a previous exercise, we used the wc command with the -l argument to count the number of lines (and thus the number of known transcripts in human) of the file refGene_hg38.txt.
  2. Among them, how many coding transcripts have been identified in this version of the genome ? Using cut, grep and wc -l, select the 2nd column, select the lines starting with NM_ and count them (use pipe characters to concatenate the commands).
  3. With almost the same command line, count the number of known non-coding transcripts.
View solution| Hide solution

Density of genes per chromosome

Chromosome sizes

Before looking at the density of genes per chromosome we will have a look at chromosome sizes. This information can be retrieved from the UCSC server (file chromInfo.txt.gz).

  1. Using your internet browser, go to the UCSC web site
  2. In the middle of the left menu select Downloads.
  3. Select the Human genome.
  4. In the "Dec. 2013 (hg38, GRCh38)" section select "Annotation database" to enter the UCSC ftp web site.
  5. Find the file chromInfo.txt.gz. Right-click on the link and select "Copy link adress" to get the URL.
  6. In a terminal check that you are located in the hg38_transcripts directory.
  7. Still in the terminal use the wget command followed by the URL to retrieve the chromInfo.txt.gz file.
  8. Use the ls command to ensure that a file was created. You should see the file chromInfo.txt.gz.
  9. Uncompress the file using the gunzip command.
  10. Use the ls command to ensure that the file was uncompressed (the file should have been renamed without the .gz extension).
  11. Rename (mv) the chromInfo.txt file to chromInfo_hg38.txt
  12. Using grep select lines containing informations regarding 'regular' chromosomes (chr1 to chr22, chrY and chrX) then use sort with -r, -n, and -k arguments to get the size of chromosomes in descending order. Use the manual (man) to get more information about the -n and -r and -k arguments of sort. Store the results in a file named chromInfo_hg38_regular.txt (use redirection operator >).
  13. Inspect file content using the command less.
  • Do you see a relationship between chromosome names and sizes?
  • View solution| Hide solution

    Number of genes per chromosome

    1. What is the number of genes per chromosome ? Take only 'regular' chromosomes into account.
    2. Given this result and the size of chromosomes, what can you say ?
    View solution| Hide solution

    Some statistics on gene structures

    Number of exons

    1. What is the gene that contains the maximum number of exons ?
    2. Using your Web browser, go to the UCSC web site. In the top menu, select Genomes. Select group: Mammal, genome: Human assembly:hg38. In the search term area paste the name of the gene. Select the first transcript of the list to visualize its genomic structure.
    View solution| Hide solution

    Writing a script

    A Shell/Bash script can be use to automate any task that can be written programmatically. The first step is to open a file (with ".sh" extension) and write a shebang as first line to indicate the program that must be used to interpret the code. The program can then be executed by asking the bash command to interpret the code.
            #!/bin/bash
    
            echo "I am you first bash script"
            echo "Do you like it ?"
        
    The script can be interpreted using:
            bash my_script_name.sh
            
    1. For instance we can write a code to compute the number of genes in hg38. For that you will need to download the refGene.txt file in an approriate folder (e.g. ~/genome/hg38), uncompress it, compute the number of genes and write it to a file.
    View solution| Hide solution

    Providing arguments to the script.

    Your script can take a set of arguments. In unix: $0 is the command name, $1 is the first argument, $2 the second, $3 the third...
            #!/bin/bash
    
            echo "First arg: "$0
            echo "second arg: "$1
            echo "Third arg: "$2
            echo "..."
        
    The script can be interpreted using:
            bash my_script_name.sh toto blabla
            
    Now the arguments can be stored in any variables and used later in the code:
            #!/bin/bash
    
            GENOME=$0
            echo "You are working with version ${GENOME} of the genome"
    
            
    Note that any variable should be called using the "$" prefix or using "$" and brace (e.g. ${GENOME}. Note also that the variable names are in uppercase here. This is just a convention when variable are constants.
    1. For instance we can write a code to compute the number of genes in hg38. For that you will need to download the refGene.txt file in an approriate folder (e.g. ~/genome/hg38), uncompress it, compute the number of genes and write it to a file.
    2. The link for hg38 is http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
    3. It could thus be written as http://hgdownload.soe.ucsc.edu/goldenPath/${GENOME}/database/refGene.txt.gz
    View solution| Hide solution

    Looping

    You may loop through a set of values using shell.
                cd /tmp
                for i in 4 5 6; do echo $i; done
                for i in "hg38 mm10 mm9"; do echo $i; done
                for ((i=1; i <= 20; i++)); do touch $i.txt;done
                ls
                cd -
        
    1. Write a loop to compute the number of genes in hg38, mm10 and rn6.
    View solution| Hide solution