AMU :: Polytech Biotech 3A :: JGB53D :: 2014/2015 :: Jacques van Helden, Denis Puthier, Nicolas Terrapon
Discovering the human genome with UNIX - Basic statistics on genes and transcripts.
Contents
Introduction
Bioinformatics is highly dependent on Linux-based computers and
software. First, with the advent of new technologies
(e.g. sequencing), analysis tend to rely more and more on
powerful computing clusters. Second, Linux/UNIX systems come
with pre-installed versions of most popular programing languages
and thus tend to become the natural platform for newly developed
software(some programs are exclusively developed for Linux-based
systems). Thus, mastering the UNIX/Linux environment and
interface is a critical step required to perform lots of
bioinformatics analysis.
In the present session we will aim at computing some
basic statistics on the human genome (number of
transcripts, number of genes, number of genes per chromosomes,
number of exons,...)
To this aim we will use information produced by The Reference
Sequence Database (RefSeq). This
project provides a comprehensive but not redundant list of known
transcripts. Refseq transcript coordinates will be obtained
from UCSC
genome browser.
Connection
Starting a session.
Connect to the computer using your login and
password. Remember that the UNIX is case-sensitive. When you
type your login and password (and more generally), be sure to
use upper and lower case characters accurately.
Launching a terminal
Start a terminal using the following menu:
Applications > Accessoires > terminal
Improving the terminal display
Before going any further...
The first time you open a terminal, it may be rather difficult
to read the information displayed in the window. To improve
readability we will configure the terminal by modifying
the .bashrc file located in your home directory. This
file can contain various information related to the behavior of
the commands.
First, in order to make a backup,
we will copy (cp)
the .bashrc file (hiden file) that is located in your
home directory (~) to .bashrc_back (hiden file) in
your home directory.
Type the following command in the terminal.
NB: the # character indicates the
beginning of a comment. The text following the comment char is
ignored by the system. In the following examples, the actual
commands are followed by comments to explain their usage.
ls -l ~/.bash* # Check that the file exists
cp ~/.bashrc ~/.bashrc_backup # Creating a backup. In case...
At this step, the list should contain
a .bash_rc file (if not, please ask
the instructors).
Use the following command which tells the system to
start gedit, a program that can be
used to edit the content of
the .bashrc file.
gedit ~/.bashrc
Copy and paste the following lines at the end of your
your .bashrc file.
Tips
- Add your settings at the end of the file, and do not
delete the other lines.
- The gedit text editor
supports synctactic coloring, which is particularly useful for
viewing and editing programming files. To highlight the
syntactic elements of the bash file, activate the following
option in gedit menu.
View -> Highlight mode -> Scripts -> sh
- Do not worry if at this stage of the course, the following
lines are not understandable for you. We perform this
operation to improve the comfort for our future work, but
these commands are already somewhat advanced seddings.
################################################################
## Personal configuration for the bash prompt
# We declare a new command 'll'. Equivalent to 'ls -l'
alias ll="ls -l"
# When using the 'ls' command, file and directory names will be colored.
alias ls="ls --color"
# If the rm command is used the system will ask before... (but don't use it please)
alias rm="rm -i"
# Changing the prompt display (don't worry it this line seems obscure to you)
export PS1="\[\033[01;34m\]\h\[\033[00m\]\[\033[01;32m\] \[\033[01;32m\]\u \[\033[00;33m\]\w\n\[\033[01;30m\e[0m\e[1;00m\]$ "
Save the file, close the terminal and open a new one
(alternatively, you can stay in the existing terminal, and run
the command source ~/.bashrc).
Your terminal is now ready.
Creating a result directory
In your home directory, we will create a sub-directory
called TD01_Bioinfo to store the
results of the practicals. This will be used to store the
datasets used in this practical session.
First, use the cd command to go
into your home directory. Then use the pwd
(print working directory) command that can
be used to display the path of the current working directory
(the one where you are currently located).
cd ~ ## or cd
pwd
- In Unix, the character "~" (tilde) denotes your home
directory, by definition.
- When the command cd is used
without arguments, it automatically goes to your home
directory.
- Now use the mkdir command to create a directory named TD01_Bioinfo.
- Check that the directory has been created using the ls command.
- Enter into the TD01_Bioinfo directory using the cd command.
- Use the mkdir command to create a sub-directory named hg38_transcripts.
- Enter into the hg38_transcripts directory using the cd command.
- Check the current value of your working directory.
View solution|
Hide solution
Solution
mkdir TD01_Bioinfo
ls
cd TD01_Bioinfo
pwd
mkdir hg38_transcripts
cd hg38_transcripts
pwd
The result should be something like this:
/home/[your_login]/TD01_Bioinfo/hg38_transcripts
Retrieving RefSeq transcript coordinates
We will retrieve information on known human transcripts. This
dataset is provided by the Reference Sequence Database (RefSeq)
and available for download through the UCSC server.
The downloaded file should contain various information about these transcripts.
Retrieving RefSeq annotations
- Using your Web Browser, go to the UCSC web site.
- In the middle of the left menu, select Downloads.
- Select the Human genome.
- In the "Dec. 2013 (hg38, GRCh38)" section, select
"Annotation database". This will connect you to the UCSC FTP web
site. After a rather long explanatory text, you will find an
even longer list of available files.
- Find the file refGene.txt.gz. Right-click on the link
and select "Copy link adress" to get the URL.
- In a terminal check that you are located in the hg38_transcripts directory.
- Still in the terminal use the wget command followed by the URL to retrieve the refGene.txt.gz file.
- Use the ls command to ensure that a file was created. You should see one file: refGene.txt.gz.
- Get some help on the gunzip command. What is it for ?
- Uncompress the file using the gunzip command.
- Use the ls command to ensure that the file was uncompressed (the file should have been renamed without the .gz extension).
- Rename (mv) the refGene.txt file to refGene_hg38.txt.
- Use the ls command with -l argument to check that the file was rename. What are the permissions on refGene_hg38.txt file?
View solution|
Hide solution
Solution
pwd # check location
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz # download
ls # get the list of files
man gunzip # q to quit
gunzip refGene.txt.gz # uncompress
ls # no .gz extension in file name
mv refGene.txt refGene_hg38.txt # rename file to refGene_hg38.txt
ls # get the list of files
What does the RefGene_hg38.txt file contain?
Getting a description of each column
The file downloaded from UCSC (refGene_hg38.txt) does not
contain any column name. An extensive description of these columns
can be obtained through the Table Browser from the UCSC
server.
Each transcript is associated to a set of annotations. For
instance, the file will contain information about the
chromosome and coordinates of the transcribed region (start
and end positions), the strand (plus or minus), the number of
exons...
- Using your internet browser, go to the UCSC web site
- Select Tables in the top menu
- Select Clade: Mammal, Genome: Human, assembly: "Dec. 2013 (hg38, GRCh38)",
group: Genes and Gene Prediction tracks, table: refGene. Click on describe table schema
- Each line of the table describes a given column of the
refGene file.
- Leave this window open so that you can go back to it if
needed.
Inspecting the file content
First lines, last lines...
- Count the number of lines of the file refGene_hg38.txt.
- Use the head command with -n arguments to see the first 5 lines of the file refGene_hg38.txt.
- Use the tail command with -n arguments to see the last 5 lines of the file refGene_hg38.txt.
- Have a look at the whole file using less (use up/down arrow keys and type "q" to quit))
- Use the head command then the tail command (use a pipe) to select the 10th line of the file refGene_hg38.txt.
- Extract the first line using head and use a pipe to redirect the output to the od command with -c argument. This will tell you the kind of character that are enclosed in the file. Based on the first line, does it seems to be a tabulated file ?
View solution|
Hide solution
Solution
wc -l refGene_hg38.txt
head -n 5 refGene_hg38.txt
tail -n 5 refGene_hg38.txt
less refGene_hg38.txt
head -n 10 refGene_hg38.txt | tail -n 1
head -n 1 refGene_hg38.txt | od -c # columns seem to be separated by tabulations ("\t"). This is a tabulated file.
Some statistics on transcript and genes
Statistics on genes
Our knowledge of the human genome is evolving. New genes are
still being discovered. In the
file refGene_hg38.txt, each line
corresponds to a transcript whose name/identifier is
provided in the 2nd column. Note that a gene can
correspond to one or more transcript. Indeed, some genes have
several alternative promoters, or can be spliced
differentially. The number of lines we counted in the previous
exercise indicates the number of transcripts identified in the
version 38 of the human genome (hg38).
The symbol/identifier of the gene associated to this
transcript is annotated in the 13th column of the
RefGenes file. What about our current knowledge of human
transcripts and genes ?
- What is the number of distinct genes in the version 38 of
the human genome ? The result can be found by computing the
number of non redondant gene symbols in the 13th
column.
Use the
commands sort
and uniq.
View solution|
Hide solution
Solution
cut -f13 refGene_hg38.txt | sort | uniq | wc -l # The number of known genes
Statistics on transcripts
- In a previous exercise, we used the wc command with
the -l argument to count the number of lines (and
thus the number of known transcripts in human) of the
file refGene_hg38.txt.
- Among them, how many coding transcripts have been
identified in this version of the genome ? Using
cut, grep
and wc -l, select the 2nd
column, select the lines starting with NM_ and
count them (use pipe characters to concatenate the
commands).
- With almost the same command line, count the number of
known non-coding transcripts.
View solution|
Hide solution
Solution
wc -l refGene_hg38.txt # How many transcripts
cut -f2 refGene_hg38.txt | grep "^NM_" | wc -l # How many coding transcripts
cut -f2 refGene_hg38.txt | grep "^NR_" | wc -l # How many non-coding transcripts
Density of genes per chromosome
Chromosome sizes
Before looking at the density of genes per chromosome we will
have a look at chromosome sizes. This information can be
retrieved from the UCSC server
(file chromInfo.txt.gz).
- Using your internet browser, go to the UCSC web site
- In the middle of the left menu select Downloads.
- Select the Human genome.
- In the "Dec. 2013 (hg38, GRCh38)" section select "Annotation database" to enter the UCSC ftp web site.
- Find the file chromInfo.txt.gz. Right-click on the link and select "Copy link adress" to get the URL.
- In a terminal check that you are located in the hg38_transcripts directory.
- Still in the terminal use the wget command followed by the URL to retrieve the chromInfo.txt.gz file.
- Use the ls command to ensure that a file was created. You should see the file chromInfo.txt.gz.
- Uncompress the file using the gunzip command.
- Use the ls command to ensure that the file was uncompressed (the file should have been renamed without the .gz extension).
- Rename (mv) the chromInfo.txt file to chromInfo_hg38.txt
- Using grep select lines
containing informations regarding 'regular' chromosomes
(chr1 to chr22, chrY and chrX) then
use sort
with -r, -n, and -k arguments to get
the size of chromosomes in descending order. Use the manual
(man) to get more information about the -n
and -r and -k arguments of sort. Store
the results in a file
named chromInfo_hg38_regular.txt (use redirection
operator >).
- Inspect file content using the
command less.
Do you see a relationship between chromosome names and
sizes?
View solution|
Hide solution
Solution
pwd
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/chromInfo.txt.gz # download chromosome informations
ls # check that the file was downloaded
gunzip chromInfo.txt.gz # uncompress
mv chromInfo.txt.gz chromInfo_hg38.txt # rename
grep -P "chr[0-9XY]+\t" chromInfo_hg38.txt| sort -rnk2 > chromInfo_hg38_regular.txt # Select 'regular' chromosomes and perform a numerical sort (reverse order)
less chromInfo_hg38_regular.txt # Inspect file content
Number of genes per chromosome
- What is the number of genes per chromosome ? Take only 'regular' chromosomes into account.
- Given this result and the size of chromosomes, what can you say ?
View solution|
Hide solution
Solution
cut -f3,13 refGene_hg38.txt | grep -P "chr[0-9XY]+\t" | sort | uniq | cut -f1 | uniq -c | sort -rn # number of genes per chromosome
# Some chromosomes (e.g chr19) tend to be gene-rich
Some statistics on gene structures
Number of exons
- What is the gene that contains the maximum number of exons ?
- Using your Web browser, go to
the UCSC web site. In the
top menu, select Genomes. Select group:
Mammal, genome: Human assembly:hg38. In
the search term area paste the name of the gene. Select
the first transcript of the list to visualize its genomic
structure.
View solution|
Hide solution
Solution
cut -f9,13 refGene_hg38.txt | sort -n | tail -1 # The gene containing the maximum number of exons
Writing a script
A Shell/Bash script can be use to automate any task that can be written programmatically. The first step is to open a file (with ".sh" extension) and write a shebang as first line to indicate the program that must be used to interpret the code. The program can then be executed by asking the bash command to interpret the code.
#!/bin/bash
echo "I am you first bash script"
echo "Do you like it ?"
The script can be interpreted using:
bash my_script_name.sh
- For instance we can write a code to compute the number of genes in hg38. For that you will need to download the refGene.txt file in an approriate folder (e.g. ~/genome/hg38), uncompress it, compute the number of genes and write it to a file.
View solution|
Hide solution
Solution
#It's your turn to find the solution...
Providing arguments to the script.
Your script can take a set of arguments. In unix: $0 is the command name, $1 is the first argument, $2 the second, $3 the third...
#!/bin/bash
echo "First arg: "$0
echo "second arg: "$1
echo "Third arg: "$2
echo "..."
The script can be interpreted using:
bash my_script_name.sh toto blabla
Now the arguments can be stored in any variables and used later in the code:
#!/bin/bash
GENOME=$0
echo "You are working with version ${GENOME} of the genome"
Note that any variable should be called using the "$" prefix or using "$" and brace (e.g. ${GENOME}.
Note also that the variable names are in uppercase here. This is just a convention when variable are constants.
- For instance we can write a code to compute the number of genes in hg38. For that you will need to download the refGene.txt file in an approriate folder (e.g. ~/genome/hg38), uncompress it, compute the number of genes and write it to a file.
- The link for hg38 is http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
- It could thus be written as http://hgdownload.soe.ucsc.edu/goldenPath/${GENOME}/database/refGene.txt.gz
View solution|
Hide solution
Solution
#It's your turn to find the solution...
Looping
You may loop through a set of values using shell.
cd /tmp
for i in 4 5 6; do echo $i; done
for i in "hg38 mm10 mm9"; do echo $i; done
for ((i=1; i <= 20; i++)); do touch $i.txt;done
ls
cd -
- Write a loop to compute the number of genes in hg38, mm10 and rn6.
View solution|
Hide solution
Solution
#It's your turn to find the solution...