Discovering the human genome with UNIX - Handling genomic coordinates with bedtools



Genomic coordinates

Working with genomic data frequently implies to deal with biological sequences. As these sequences (genes, exons, ...) are related to genomes, handling genomic coordinates is typically required. A basic task in genomic data analysis is to compare different sets of genomic features (transcripts, promoter regions, polymorphisms, conserved elements,...). One question could be for instance "which SNPs (Single Nucleotide Polymorphisms) are associated with a disease of interest and fall into exonic regions". Such questions require dedicated tools to ease data analysis.

Bedtools has been developed to compare sets of genomic features and is now becoming a standard Linux tool for people working in the field. Bedtools largely relies on the BED file format (although it may also operate with GFF/GTF, VCF, and SAM/BAM files).

The Bed file format is a very simple way to store information related to genomic features. A typical file in BED format will contain the following columns (the 3 first columns are mandatory):

  1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2...).
  2. chromStart - The starting position of the feature in the chromosome.
  3. chromEnd - The ending position of the feature in the chromosome.
  4. name - A name for the feature (e.g. gene name...).
  5. score - A score between 0 and 1000.
  6. strand - - Defines the strand - either '+' or '-'.

Bed format

Several conventions exist to describe genome coordinates. The BED file format is said to be "zero-based, half-open".

Opening a session

Starting a Unix terminal

  1. Connect to the computer using your login and password.
  2. Open a terminal. You should find a terminal (terminal, konsole, terminator,...) in the Application Menu.

Creating a result directory for this practical

  1. Using the mkdir command (make directory), create a working directory named TD02_Bioinfo in your HOME directory.
  2. Use the cd (change directory) command to enter the TD02_Bioinfo directory, and check that you are located in this directory.
View solution| Hide solution

The Bedtools suite: "a swiss army knife for genome arithmetic"

Installing the bedtools suite

We will install the latest Bedtools suite. To this aim, we will retrieve a small script that contains simple BASH commands that will automatically perform for us the installation steps:

    The script will automaticaly
  1. Use wget to retrieve the source code of the BedTools program.
  2. Uncompress the zip file (unzip). This will create a folder named bedtools2-master.
  3. Run the command mkdir to create a new folder in your home directory (~/soft), where the BedTools software will be stored (and which may be used later to store other software tools).
  4. Move (mv) the directory bedtools2-master to the directory ~/soft.
  5. Change the shell working directory to ~/soft/bedtools2-master (cd).
  6. Compile Bedtools from sources (make).
  7. Add a line to your ~/.bashrc file so that the folder ~/soft/bedtools2-master/bin will be added to the PATH variable (PATH is an environment variable specifying a set of directories where executable programs are located).
  8. The installation script can be downloaded here (right-click on the link and select "Copy link adress" to get the URL.).

    1. Go to the /tmp directory (cd).
    2. Using the wget command, download the installation script.
    3. Use the ls command with -l arguments to check that the file is present. Have a look at access permissions.
    4. check the content of script with less.
    5. Use the chmod command to give yourself (the User) eXecute permission on file
    6. Using the ls command with -l arguments check access permissions.
    7. Execute the script.
    8. Instruct the terminal to reload the ~/.bashrc file (source)
    9. Go back to the ~/TD02_Bioinfo directory (cd).
    10. Type bedtools -h
    View solution| Hide solution

    Which fraction of the human genome is covered by exons ?

    In the section below we will try to answer the following question: "Which fraction of the genome is covered by exons ?".

    Downloading genomic coordinates of exons.

    First, we will retrieve feature coordinates (exons) from the UCSC server.

    1. Using your Web browser, go to the UCSC web site.
    2. Select Tables in the top menu.
    3. Select the following parameters:
      • Clade: Mammal,
      • Genome: Human,
      • assembly: "Dec. 2013 (hg38, GRCh38)",
      • group: Genes and Gene Prediction tracks,
      • track: RefSeq Genes,
      • table: refGenes,
      • region: genome,
      • Output format: BED.
    4. Set output file to "RefGene_hg38_exons.bed" and click on Get output.
    5. In the next window, select Exons (plus 0 bases at each end). Leave all other options unchanged, click the button get BED.
        Note that with the same protocol we could also select the coordinates of all transcripts or intronic regions).
    1. Use the mv command to move the file RefGene_hg38_exons.bed from ~/Téléchargements (or ~/Downloads) to the your result directory TD02_Bioinfo.
    2. Look at the 6 first lines of the file. Does it look like a bed format ?
    3. Is it a tabulated file ?
    View solution| Hide solution

    Merging overlapping regions

    To answer our simple question, we must first keep in mind that several exons may overlap, due to various phenomena (alternative splicing, multiple promoters or terminators, mutually overlapping genes). To avoid counting several times the same region, our first task will thus be to merge these overlapping regions. The mergeBed command from the Bedtools suite combines overlapping features in an interval file into a single feature which spans all of the combined features. The image below illustrate this.

    Beware: MergeBed requires the genomic coordinates to be sorted (see below).

    We will first discard genes located on "non-regular" chromosomes. For this, we consider as "regular" the chromosome names starging with "chr" followed by one or more numbers (chr1, chr17,...) or the specific letters "X" or "Y" (chrX, chrY). We will first select the features from the bed files that match these regular names, and then count among them (i) the total number of exons and (ii) the number of exons per chromosome.

    grep -P "^chr[0-9XY]+\t" RefGene_hg38_exons.bed >  RefGene_hg38_exons_reg.bed    # delete non 'regular' chromosomes.
    wc -l RefGene_hg38_exons_reg.bed                                                 # Total number of exons
    cut -f1 RefGene_hg38_exons_reg.bed | sort | uniq -c | sort -rn                   # Check the number of exon per chromosome
    1. Get some help about the sortBed command using the -h (help) argument.
    2. Use the sortBed command to sort exons by coordinates and store the results in RefGene_hg38_exons_reg_sort.bed.
    3. Get some help about the mergeBed command using the -h (help) argument.
    4. Use mergeBed with RefGene_hg38_exons_reg_sort.bed as input to combine overlapping exons into single features and store the results into a file named mergedExons.bed.
    5. Have a look at the first lines with the head command.
    6. Count the number of lines in all *.bed files using wc.. Is the result as expected?
    7. The length of one genomic feature can simply be obtained by computing column_3-column_2. Use a awk command to compute the sum of the length of all features.
    8. Compute the total length of the genome using the file ~/TD01_Bioinfo/hg38_transcripts/chromInfo.txt (see TD01). As an alternative, download the file here using wget.
    9. Now, what is the fraction of the genome that is covered by exons or genes ?
    View solution| Hide solution

    Genomic locations of SNPs associated with prostate cancer

    Genome-Wide Association Studies (GWAS) are used in epidemiology to search for common genetics variants associated with a given disease. GWAS typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like diseases. Prostate cancer (PrCa) is the most frequently diagnosed cancer in males in developed countries. To identify common PrCa susceptibility alleles, Eeles RA et al, conducted a GWAS whose results are available through GWAS Central (study HGVST512). The top 50 associations (dataset: HGVRS986) were retrieved from GWAS Central and converted to a BED format (build hg38).

    Which of these SNPs fall into exonic regions ?

    The intersectBed program can be used to answer such questions (click for more informations).

    1. Get some help about the intersectBed program (argument-h).
    2. Download SNPs list in BED format here.
    3. Use the intersectBed command with -a, -b, -wa and -wb arguments to find SNPs falling into exonic regions.
    View solution| Hide solution

    Which of these SNPs fall into intronic regions ?

    1. Download intronic regions as bed format here.
    2. Use the intersectBed command with -a, -b, -wa and -wb arguments to find SNPs falling into intronic regions.
    View solution| Hide solution

    Which of these SNPs fall into promoter regions ?

    As you can see, lots of these SNPs are located in intergenic regions (i.e. outside known genes). One additional question could be whether some of them are falling into promoter regions. As the promoter regions is difficult to define without additional informations (e.g. epigenetic marks) we will define it, here as the regions ransging from the the transcriptional start site (TSS) to -500bp upstream of the TSS. To answer this question, we need to extract those regions.

    1. Download the coordinates of the whole transcripts here (note that you can get it also from the table browser).
    2. Use the following awk onliner to extract promoter region coordiantes:
      	  gunzip RefGene_hg38_wg_reg.bed.gz
      	  cut -f6 RefGene_hg38_wg_reg.bed | sort | uniq -c   # ensure that all transcript strands are defined
      	  awk 'BEGIN{FS=OFS="\t"}{if($5=="+"){print $1,$2-500,$2,$4,$5,$6}else{print $1,$3,$3+500,$4,$5,$6}}'  RefGene_hg38_wg_reg.bed > RefGene_hg38_prom_reg.bed # get the promoter regions. 

    One of the remaining problem is that or promoter regions from transcript t may overlap with exonic regions from another transcript. We can not strictly declare them as regulatory regions. We thus can discard these overlapping regions. This can be done with the subtractBed (click for more informations) command from the Bedtools suite.

    1. Use the subtractBed to delete any promoter region overlapping exons (create a file named RefGene_hg38_prom_reg_noExons.bed).
      • To ensure that this step was effective go to UCSC.
        1. From the top Menu select Genomes
        2. Select group: Mammal, genome: Human, assembly: "Dec. 2013 (hg38, GRCh38)".
        3. Click on manage custom tracks > add custom track > Choose File (browse to RefGene_hg38_prom_reg_noExons.bed) and click Submit.
        4. click on go to genome browser. Enter position chr1:201504430-201512040 (for instance) to check the result.
    2. Use intersectBed to find SNPs overlapping promoter regions.
    3. Get information relative to NM_138634 and its associated gene here. Is there any link with PrCa?
    View solution| Hide solution