# Exercise

Li and coworkers (2014) used RNA-seq to characterize the transcriptome of 92 skin samples of people suffering from psoriasis, and 82 control samples. We pre-processed the raw sequences (NGS reads) to obtain a data table containing the counts per gene (row) for each sample (columns).

## Goals of this practical

1. Detection and interpretation of differentially expressed genes: run a statistical analysis of the read counts, in order to detect differentially expressed genes (DEG) between psoriasis and control samples (Li et al. 2014), and to study the functional associations of these genes, and provide a biological interpretation of these results.

2. Methodological evaluation:

1. Robusntess analysis: analyse the impact of individual particularities of the patients/controls, and measure the impact of the sample size by applying a sub-sampling approach.

2. Negative control: check the reliability of the DEG detection method (DESeq2) by running the same analysis with two subset of samples belonging to the same group. For example, select 10 WT samples and 10 other WT samples, and run differential analysis between them. In principle, the program should return a negative result, i.e. declare that no gene is differentially expressed, since the samples come from the same group. Consequently, if the negative control declares genes as differentially expressed, these genes should be considered as false positives. The good answer would thus be to declare not a single DEG, or, if some genes are declared differentially expressed, it should be with a barely significant p-value. If the negative control returns many DEG and/or associates genes with very low p-values, it means that we have a problem either with the method used, or with the dataset (for example it does not comply with the underlying assumptions for the test). More precisely, the negative control is an empirical way to measure if the actual rate of false positive coresponds to the expected rate(i.e. if the p-value, or derived statistics, can be considered as reliable indications of the significance).

3. Usage of the good practices. An mportant goal of this course – and the report – is to learn how to use good practices for the analysis of NGS data. This includes

1. Tractability: you and other people should be able to track the origin of all your results. For this, you need to keep a trace of each step of each analysis.

2. Reproducibility: other people should be able not only to trace the origin of your results, but also to reproduce them by themselves. NGS data anlaysis lends itself particularly well to reproducibility, since everything is done on computers and managed via software (tools and scripts).

3. Portability: the analysis done on your computer should be reproducible on other computers as well. For this you need to ensure for isntance that all paths are defined relative rather than absolute, and that the path definitions rely on platform-independent methods.

## Dataset

### Processing

Table 1. Small piece of the count table.
GSM1315790_M9132_Psoriasis_skin_SRX451477 GSM1315789_M9081_Psoriasis_skin_SRX451476 GSM1315788_M9049_Psoriasis_skin_SRX451475
ABHD5 2126 1519 1423
ABHD6 434 263 407
ABHD8 319 418 598
ABI1 3585 3004 2610
ABI2 652 518 521
ABI3 250 151 299
ABI3BP 1956 850 1299
ABL1 2769 3475 2889

The count table contains 23 368 rows (one per gene) and 174 columns (one per sample).

The pheno table contains 174 rows (one per sample) and 5 columns (one per sample attribute).

Table 2. First rows of the pheno table. Each row describes one sample.
GSM_ID M_ID Group tissue SRX_ID
GSM1315790 M9132 Psoriasis skin SRX451477
GSM1315789 M9081 Psoriasis skin SRX451476
GSM1315788 M9049 Psoriasis skin SRX451475
GSM1315787 M9027 normal skin SRX451474
GSM1315786 M9004 normal skin SRX451473
GSM1315785 M9000 normal skin SRX451472
Table 3. Number of samples per group.
Group Number of samples
normal 82
Psoriasis 92

## Analyses

1. Differential analysis. Analyse the full dataset to detect differentially expressed genes. This analysis should include the following steps.

1. Log2-transformation of the counts (with an epsilon).
2. Graphical description of the data (historgrams, barplots, boxplots, …).
3. Computation of summary statistics per sample (min, max, mean, median, quartiles, number of zeros, …) for raw counts and log2-transformed counts.
4. Detection of differentially expressed genes (including graphical representations: volcano plot, p-value histogram, …).
5. Functional enrichment of the differentially expressed genes.
6. … any other type of analysis, figure, table that you might find useful to intepret the data.
2. Sub-sampling. Run the same analysis on randomly selected subsets of the samples, with various subset sizes (n=2,3,4,5,10,20,40).

3. Negative control. Run the same analysis with two subsets of samples belonging to the same group (psoriasis versus psoriasis, control versus control).

# Report

Les rapports peuvent être rédigés en français ou en anglais / reports can be written in either English or French.

## Format of the report

1. Source document in Rmd. The primary report is an R markdown document (.Rmd extension) which must contain all the code used to run the anlayses, as well as the main tables and figures produced by the analysis, and a text structured according to the common practice for scientific articles.

• The R code should be compliant with the following guidelines: https://google.github.io/styleguide/Rguide.xml
• The R code should be properly documented.
• This document should enable us to reproduce your analysis on our own computer (avoid absolute paths).
2. Compiled report (html or pdf). This report should look like a small scientific article (see structure hereafter) with Figures, Tables, and interpretation of the results. The R code should not be displayed in the compiled report (set the knitr option echo=FALSE when generating the last version). Think of your report as a document written for a biologists who want to understand the approach and the results, but is not interested by the technical details of the R programming.

## Structure of the report

In total, the report should not exceed 5-6 pages, including figures, but without counting the bibliographic references and appendices (for which there is no limit).

1. Introduction: a brief summary (5-10 lines) of the biological context (the disease, the transcriptome), the biological question addressed in the report, and the general approach envisaged to answer these questions.

2. Material and Methods: a summary of the bioinformatics / statistical methods and libraries used for the analysis (1/2 to 1 page), with a brief explanation about which tool was used to do what, and links to the official web page or publication about the tool. If required, additional details about the methods and parameters can be provided in appendix.

3. Data description: a brief description of the data source (with link to the GEO record), its content (how many samples, how many groups, …) and the data type (paired-ends or single-end, …).

4. Results and discussion: the results should be presented and discussed together in this section. This section contains the main figures and tables that are used to interpret the results. Additional figures and tables can be provided as appendices, or as separate files (for example full result tables with all the genes, …).

5. Conclusion and perspective (~1/2 page): summarize the results, show in what they did – or did not – enable you to answer the initial questions of the introduction, and add some perspectives about possible future extensions of the work presented here.

6. **Appendices:* any additional information or result that might be helpful to consult in order to get a deeper understanding of your results.

Format: the report can be submitted in either html or pdf format. The original Rmd file that was used to generate the report must be submitted together with the report. This file should allow the teachers to reprodue the analysis on their computers.

## Evaluation criteria

The evaluation of your report will be based on multiple criteria.