Abbreviations

Abbreviation Description
ALL Acute Lymphoblastic Leukaemia
A/M/P Absent / Marginal / Present calls

Goal of this practical

The goal of this practical is to understand the principles of Student t-test by exploring step-by-step the different components of the \(t_{obs}\) statistics. To this purpose, we will apply a test to detect differentially expressed genes (DEG) between two cancer types in microarray transcriptomic profiles.


Study case

As study case, we will analyse microarray transcriptomic profiles from (M. L. Den Boer et al. 2009).

In this experiment, the authors characterized the transcriptome of 190 patients suffering from Acute Lymphoblastic Leukaemia (ALL). Lymphoblastic leukaemia is characterized by the abnormal clonal proliferation, within the bone marrow, of lymphoid progenitors blocked at a precise stage of their differentiation. The abnormal proliferation can result from various mutations, some of which are recurring in various patients. The DenBoer dataset includes information about the genetic problem associated with each leukaemia.

In their original article, Den Boer and co-workers defined a transcriptomic signature, i.e. a selection of genes whose transcription profile is specifically affected in one or more subtypes of leukaemia. The expression profiles of these genes can then be used to classify new samples from acute lymphoblastic leukaemia (ALL), and predict the type of associated genetic perturbation.

An advantage of this dataset is that we dispose of relatively large samples (>30 individuals per group), relative to many other publications of transcriptome analysis.

Pre-requesite: installing Bioconductor

We will need to use some Bioconductor libraries in the subsequent steps (e.g the qvalue library). Bioconductor is a project offering open source libraries for R environnement dedicated to analysis of high-throughput genomic data. If Bioconductor is not already installed on your computer, copy and paste the following instructions in R:

## Install specifically the qvalue library
if (!require("qvalue")) {
  ## Install base libraries for Bioconductor
  ## If you are asked to create a personnal library answer "YES".
  source("https://bioconductor.org/biocLite.R")
  biocLite()
  biocLite("qvalue")
}

Pre-processing of the data

The full dataset is available in the the Gene Expression Omnibus (GEO) database, as a series with ID GSE13425. However we do not need to download the raw dataset. Instead, we will provide on this web site the normalized matrix expression profiles in the form of tab-delimited files.

Data were produced using Affymetrix geneChips (Affymetrix Human Genome U133A Array, HGU133A). Information related to this platform are available on GEO website under identifier GPL96. More information related to Affymetrix microarrays normalization can be obtained in the section Normalization of Affymetrix DNA chip (slides) and Handling and normalizing affymetrix data with bioconductor (practical)).


Tutorial: loading data tables into R

Summary

  • Start R.

  • Have a look at the description of the read.table() function: in the R console, type help(read.table) and read the help page.

  • In the next sections, will load three data tables into R using the R function read.table(). This function allows us to directly read tab-delimited text files located either on our own computer or on a remote server (using an http URL). We will successively load 3 files providing complementary information.

    • the expression matrix (GSE13425_Norm_Whole.txt)
      • Contains genes as rows and samples as columns.
      • Data were previously normalized using RMA algorithm (they are thus transformed in logarithm base 2).
    • the A/P/M matrix (GSE13425_AMP_Whole.txt)
      • Indicates whether a gene was called Absent, Present or Marginal.
    • Phenotypic data (GSE13425_phenoData.txt)
      • The GSE13425_phenoData.txt file contains phenotypic information about samples.

Creating an output directory on your own computer

We will define and create a directory to store the results on our computer.

## Define the output directory. You can adapt this to your local configuration.
dir.output <- "~/STAT2_practicals/GSE13425"

## Create the output directory (if it does not exist yet)
dir.create(dir.output, showWarnings=FALSE, recurs=TRUE)

## Change directory to dir.output
setwd(dir.output)

## Print a message with output directory
message("Output directory for this practical: ", dir.output)

## Check it this directory already contains some files.
list.files(dir.output)
[1] "data"

Definition of the course URL and data path

We will load the data from this course supporting Web site. For this we need to define two variables.

  1. The general URL of this course, which we store in a variable url.course.
  2. The path to the microarray datasets, which is defined by appending the folder (data/marrays) to the URL of the course. For this, we use the R function file.path(), which automatically concatenates the pieces of a file or folder path.
## Define the URL of the example data sets
url.course <- "http://pedagogix-tagc.univ-mrs.fr/courses/ASG1"
microarray.data.url <- file.path(url.course, "data", "marrays")

## Check the results
print(paste("Course URL: ", url.course))
[1] "Course URL:  http://pedagogix-tagc.univ-mrs.fr/courses/ASG1"
print(paste("URL for the microarray data: ", microarray.data.url))
[1] "URL for the microarray data:  http://pedagogix-tagc.univ-mrs.fr/courses/ASG1/data/marrays"

Loading the expression matrix

We will now load the expression matrix, which is stored in a file named GSE13425_Norm_Whole.txt in the microarray data folder of the server. This file contains a table with one row per probeset (~gene) and one column per sample. Each sample was taken from a patient suffering from a particular type of Acute Lymphoblastic Leukemia (ALL).

We will once use the following functions:

  • file.path() to build the full URL to the expression table
  • download.file() to download the file from the URL and obtain a local copy. This will avoid us to download the whole file each time we restart the analysis.
  • read.table() to read the table from this URL. Note that this functions allows you to load a table either from a local folder (on your computer), or from any Web site.
## Get some help about the read.table fonction
# help(read.table)

## URL of the expression table on the course server
expr.url <- file.path(microarray.data.url, "GSE13425_Norm_Whole.txt") ## Build the full URL of the matrix

## Load expression values
dir.data <- file.path(dir.output, "data")
dir.create(dir.data, recursive = TRUE, showWarnings = FALSE)
expr.file <- file.path(dir.data, "DenBoer_2009_GSE13425_Norm_Whole.txt")

if (file.exists(expr.file)) {
  message("Expression file already downloaded: ", expr.file)
} else {
  message("Downloading matrix table from ", expr.url)
  download.file(url = expr.url, destfile = expr.file)
  message("Local expression file: ", expr.file)
}

## Load the expression table in memory
message("Loading expression table from ", expr.file)
expr.matrix <-  read.table(expr.file,sep="\t", head=T, row=1)

Questions

After having loaded the expression matrix with the R function read.table(), we should have a variable of type data.frame containing one row per gene and one column per sample. Before going further, we will perform some basic tests to check the content of this data frame.

  1. Count the number of genes and samples in the expression table.
  2. Display the first 10 rows and 5 columns of the matrix
  3. Draw an histogram with the values of the \(26^{th}\) column (sample) of the expression matrix.

Solution

## Check the type of the variable "expr.matrix" (should return data.frame)
class(expr.matrix)
[1] "data.frame"
## Get the dimensions of the expresion table
dim(expr.matrix)
[1] 22283   190
## Number of probesets (rows)
nrow(expr.matrix)
[1] 22283
## Number of samples (columns)
ncol(expr.matrix)
[1] 190
## Display the first 10 rows and 5 columns of the matrix
print(expr.matrix[1:10,1:5])
                 GSM338666 GSM338667 GSM338668 GSM338669 GSM338670
DDR1|1007_s_at        4.38      4.38      4.23      3.98      5.12
RFC2|1053_at          3.44      3.33      3.73      3.58      3.40
HSPA6|117_at          2.51      2.62      2.37      2.43      2.49
PAX8|121_at           6.24      6.21      6.10      6.09      6.53
GUCA1A|1255_g_at      2.29      2.23      2.17      2.26      2.27
UBA7|1294_at          5.57      5.83      5.46      5.20      6.83
THRA|1316_at          3.09      3.34      3.10      3.20      3.16
PTPN21|1320_at        2.35      2.46      2.30      2.34      2.21
CCL5|1405_i_at        2.45      2.52      2.48      2.44      2.22
CYP2E1|1431_at        2.33      2.42      2.18      2.15      2.27

Since we are generating this report with an R markdown document with RStudio, we can also use the kable() function of the R package knitr to generate a nicely formatted table.

## Generate a nicely formatted table. 
library(knitr) ## Load the R library knitr 
kable(expr.matrix[1:10,1:5]) ## Generate a nicely formatted table
GSM338666 GSM338667 GSM338668 GSM338669 GSM338670
DDR1|1007_s_at 4.38 4.38 4.23 3.98 5.12
RFC2|1053_at 3.44 3.33 3.73 3.58 3.40
HSPA6|117_at 2.51 2.62 2.37 2.43 2.49
PAX8|121_at 6.24 6.21 6.10 6.09 6.53
GUCA1A|1255_g_at 2.29 2.23 2.17 2.26 2.27
UBA7|1294_at 5.57 5.83 5.46 5.20 6.83
THRA|1316_at 3.09 3.34 3.10 3.20 3.16
PTPN21|1320_at 2.35 2.46 2.30 2.34 2.21
CCL5|1405_i_at 2.45 2.52 2.48 2.44 2.22
CYP2E1|1431_at 2.33 2.42 2.18 2.15 2.27

Distribution of expression values

Before going any further, we will draw an histogram with the distribution of all the expression values for all genes and all samples.

The R function hist() enables to draw an histogram, give a vector of values. To get more information about this function, you can type help(hist) in the R console.

Since the expression matrix is stored in and R object of class data.frame, we will first need to convert it to a vector. For this, we can use the R function unlist().

## Histogram of all expression values
hist(unlist(expr.matrix), breaks=100, 
     main="Den Boer expression values",
     xlab="RMA-normalized expression value (log2-transformed)",
     ylab="Number of measures",
     )