About plotnine

When doing descriptive statistics we frequently need to partition the graphics based on categorical (i.e. qualitative) or ordinal variables. Doing such graphics may be particularly difficult when using classical Python graphical libraries (e.g matplotlib). The R software benefits from a very nice library for such a task, the ggplot2 package developed by Hadley Wickham (Wickham 2016) This package has been quickly became really popular in the bioinformatic field (here categories may be gene, groups of genes, species, signaling pathways, epigenetic marks…) and ordinal variables a discretized level of expression, for instance. The ggplot2 R package is an implementation of the graphical model proposed by Leland Wilkison in his book: The Grammar of Graphics (Wilkinson 2016). In this model, the graph is viewed as an entity composed of data, layers, scales, coordinate system and facets. One can create a graphic then add the various component using the + operator. Although the syntax may appear a little bit tricky for beginners, one can quickly understand the benefit of such an approach when composing complexe diagrams.

Several projects have proposed a port of ggplot2 under Python. The plotnine library is one of these projects that proposes a rather stable and exhaustive port of ggplot2 under Python. In the subsequent tutorial, we will use the chickwts dataset that is available in the R datasets library. We will propose this dataset through an URL available for download. The information we have about the chickwts dataset are the following:

Chicken Weights By Feed Type: An experiment was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens.

Downloading the dataset

We will download the dataset (a tabulated flat file) and load it into a pandas DataFrame. The DataFrame¨ object is also a port from a popular object in the R world (data.frame). This DataFrame can be viewed as a matrix whose columns may be of various types (Objects, int64, floats…). The DataFrame object contains various functions to perform operations on the dataset.

##     weight       feed
## 0      179  horsebean
## 1      160  horsebean
## 2      136  horsebean
## 3      227  horsebean
## 4      217  horsebean
## 5      168  horsebean
## 6      108  horsebean
## 7      124  horsebean
## 8      143  horsebean
## 9      140  horsebean
## 10     309    linseed
## 11     229    linseed
## 12     181    linseed
## 13     141    linseed
## 14     260    linseed
## 15     203    linseed
## 16     148    linseed
## 17     169    linseed
## 18     213    linseed
## 19     257    linseed
## 20     244    linseed
## 21     271    linseed
## 22     243    soybean
## 23     230    soybean
## 24     248    soybean
## 25     327    soybean
## 26     329    soybean
## 27     250    soybean
## 28     193    soybean
## 29     271    soybean
## ..     ...        ...
## 41     226  sunflower
## 42     320  sunflower
## 43     295  sunflower
## 44     334  sunflower
## 45     322  sunflower
## 46     297  sunflower
## 47     318  sunflower
## 48     325   meatmeal
## 49     257   meatmeal
## 50     303   meatmeal
## 51     315   meatmeal
## 52     380   meatmeal
## 53     153   meatmeal
## 54     263   meatmeal
## 55     242   meatmeal
## 56     206   meatmeal
## 57     344   meatmeal
## 58     258   meatmeal
## 59     368     casein
## 60     390     casein
## 61     379     casein
## 62     260     casein
## 63     404     casein
## 64     318     casein
## 65     352     casein
## 66     359     casein
## 67     216     casein
## 68     222     casein
## 69     283     casein
## 70     332     casein
## 
## [71 rows x 2 columns]

What about the type of object returned by pandas.read_csv() ? What about the types of the columns ?

## <class 'pandas.core.frame.DataFrame'>
## weight     int64
## feed      object
## dtype: object

Creating a basic diagram

Changing the global diagram theme

The diagram theme can be changed using call to functions from plotnine starting with theme_.

  • Using completion, discover the various functions proposing builtin themes for the diagrams.
  • Test some of them to change the global graphic rendering.

Theming your diagram

The diagrams can be tweaked more deeply by passing some arguments to the theme() function. Several aspects of the diagram are thus themeable. The list of themeable elements is provided here. The themeables are objects of several classes:

For instance, the following code changes various elements of the theme:

  • The axis text.
  • The axis title.
  • The axis ticks.
  • The panel grid.
## <ggplot: (-9223372029296779204)>

  • Use the themeable elements to create your own boxplot or violin diagram. Think about using html hexadecimal colors such as those proposed on ColorBrewer web site.

More on aesthetics/mapping

At the moment, we have defined, for a boxplot or violin plot, the categories that should appear on the x axis and the values whose distributions should appear on the y axis. We may also assign additional aesthetics. These additional aesthetics may be ‘fill’ (for the color of the boxes in the boxplot) or the ‘color’ for the colors of the box borders (there are also additional aesthetic depending on the geom function).

If ‘fill’ and ‘color’ are passed to the geom_boxplot() this mean that colors should be the same for all boxes.

## <ggplot: (-9223372029292431374)>

Now, another solution is to path ‘fill’ and ‘colors’ to the aes() function. In this case it means that we want to change the colors according to the categories found in ‘feed’. This also mean that we need a way to tell plotnine which colors we want to apply as it will just use a set of default colors.

## <ggplot: (7563094101)>

To change the colors we now need to use the scale_color_manual() and scale_fill_manual() functions to which we can pass a dictionary containing the classes to colors mapping. Here the classes are the following:

## Index(['casein', 'horsebean', 'linseed', 'meatmeal', 'soybean', 'sunflower'], dtype='object')

So we just need to create a dictionnary containing the classes (i.e ‘feed’) and the associated colors.

## <ggplot: (7563644238)>

We may improve this plot by changing the legend attributes.

## <ggplot: (7563644238)>

  • Create your own violin plot using your own theme.

Other diagrams

There are about 40 different graphics currently available in plotnine. The names of the associated functions start with ’geom_’ (e.g.* geom_boxplot, geom_tile, geom_text, geom_smooth, geom_rug, geom_hist, geom_bar…).

Histograms and densities

In the case of histograms, the x axis corresponds to intervals (‘bins’) and the y axis to the number of values falling in the intervals. Thus, there is only one value to pass to aes().

## <ggplot: (7564864681)>

Probability density can be displayed using geom_density(). Unfortunatly, the number of data in each class is clearly too limited here. We will see a better example later.

## <ggplot: (-9223372029289950404)>

Overlaying diagrams

One can overlay several diagrams of various types very easily, just using the ‘+’ operator. For instance one can first create a simple scatterplot using geom_point().

## <ggplot: (7563095048)>

However as they are some ties it may be advised to use the geom_jitter() function that will add some randomness to the value of the x axis (that here are categorical but can be viewed as 1, 2, 3…)

## <ggplot: (7563095048)>

As diagrams can be viewed as layers, we may also add a geom_rug() layer.

## <ggplot: (7563095048)>

Partitioning graphics using facets

Partitioning the diagram based on a given factor/variable allows one to explore deeply the dataset. For the following example, we will create a matrix containing the results of an artificial ELISA experiment in which several measures done at two different times and by four different researchers are recorded.

Loading the dataset

The column names of the DataFrame are the following:

The number of elements:

These are ELISA plates so each of them contains 96 elements.

Facetting

Using plotnine or ggplot2 syntax, it becomes very easy to assess the distribution of the results (available in the ‘values’ column) depending on users:

## <ggplot: (-9223372029288677169)>

Interestingly, one can also easily create facets based on two variables, here user and days:

## <ggplot: (-9223372029286675175)>

Or alternatively…

## <ggplot: (7569214079)>

Or even

## <ggplot: (7569844400)>

Creating heatmaps

As we are working with an artificial ELISA dataset, it can be interesting to reproduce a color-coded image of the ‘original’ ELISA plates. The geom_tile() function can be used to create heatmaps.

## <ggplot: (-9223372029288673287)>
## 
## /Users/puthier/miniconda3/envs/pygtftk/lib/python3.6/site-packages/matplotlib/pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
##   max_open_warning, RuntimeWarning)

Practical 1 (About processed pseudogenes)

The dataset

Here, our dataset contains several informations related to transcripts in the human genome. They were computed using pygtftk (v0.9.8) from a GTF file downloaded from ensembl (genome version GRCh38, release 92). First we will load the dataset, set the row names to the transcript ids and inspect the column names.

  • What is the type of tx_info ?
  • How many lines does the dataset contain ?
  • How many columns ?
  • What are the columns names ?
  • How many chromosomes are defined ? You can access columns using tx_info[‘column_name’].
  • How many different transcripts ? The row names can be found in the ‘.index’ attributes of the DataFrame.
  • How many gene can be found in the dataset ? Use the gene_id column.
  • What are the possible values for ‘gene_biotype’ ?
<< Hide | Show >>
## <class 'pandas.core.frame.DataFrame'>
## 178654
## 31
## ['ccds_id', 'convergent', 'dist_prom_overlap_tss_from_other_gene', 'dist_to_convergent', 'dist_to_divergent', 'divergent', 'end', 'exon_sizes', 'feature', 'gene_biotype', 'gene_id', 'gene_name', 'gene_source', 'gene_version', 'intron_size', 'mature_rna_size', 'nb_exons', 'phase', 'prom_overlap_tss_from_other_gene', 'score', 'seqid', 'source', 'start', 'strand', 'tag', 'transcript_biotype', 'transcript_name', 'transcript_source', 'transcript_support_level', 'transcript_version', 'tx_genomic_size']

Subseting the dataset

In the subsequent analysis we will only focus on transcript classes (‘gene_biotype’) for which at least 500 transcripts are found.

Looking at transcript length and the case of processed pseudogene

  • Explore the transcript genomic length (‘tx_genomic_size’, size of exons plus introns) variable. What can you say about the distributions of transcript genomic length regarding chromosome and gene_biotype ? You can use geom_density() or geom_histogram() for graphical display.
  • What about the size of processed pseudo-genes ? How may one explain this difference ?
<< Hide | Show >>
## <ggplot: (-9223372029273113533)>
## 
## /Users/puthier/miniconda3/envs/pygtftk/lib/python3.6/site-packages/matplotlib/pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
##   max_open_warning, RuntimeWarning)

## <ggplot: (-9223372029275049184)>
## 
## /Users/puthier/miniconda3/envs/pygtftk/lib/python3.6/site-packages/matplotlib/pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
##   max_open_warning, RuntimeWarning)

## <ggplot: (-9223372029280328637)>
## 
## /Users/puthier/miniconda3/envs/pygtftk/lib/python3.6/site-packages/matplotlib/pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
##   max_open_warning, RuntimeWarning)

## <ggplot: (-9223372029280332968)>
## 
## /Users/puthier/miniconda3/envs/pygtftk/lib/python3.6/site-packages/matplotlib/pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
##   max_open_warning, RuntimeWarning)

  • Now, what can you say about the number of exons for transcripts depending on ‘gene_biotype’ ? Use a proper graphic to display this information.
  • What can we say about processed pseudo-genes ? Is that expected ?
<< Hide | Show >>
## <ggplot: (-9223372029281954602)>
## 
## /Users/puthier/miniconda3/envs/pygtftk/lib/python3.6/site-packages/matplotlib/pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
##   max_open_warning, RuntimeWarning)

For the next diagrams we will order the chromosome from 1 to 22 then X, Y. Plotnine doesn’t know about the chromosomes order. By default it will just display them the way they were encountered. However, we can define the order of a qualitative variable. To this aim we need to use pd.Categorical() that can change a dataframe column so that it becomes a categorial variable that can even be ordered (ordinal variable):

## transcript_id
## ENST00000456328    chr1
## ENST00000450305    chr1
## ENST00000488147    chr1
## ENST00000473358    chr1
## ENST00000469289    chr1
## Name: seqid, dtype: category
## Categories (24, object): [chr1 < chr2 < chr3 < chr4 ... chr21 < chr22 < chrX < chrY]
  • What can you say about the distribution of each gene biotype across chromosomes (i.e the number of each category per chromosome). Draw a barplot (geom_bar) showing the number of transcripts from each gene_biotype class on each chromosome. Use geom_bar with argument position set to ‘stack’, ‘dodge’ or ‘fill’. depending on this argument, do you get the same feeling about the way the distribution of gene biotype across the chromosomes. What are the pros and cons of each representation ?
  • What can we say about ‘processed pseudogene’ ?
  • What is another intriging feature of the Y chromosome ?
<< Hide | Show >>
## <ggplot: (-9223372029278027797)>
## 
## /Users/puthier/miniconda3/envs/pygtftk/lib/python3.6/site-packages/matplotlib/pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
##   max_open_warning, RuntimeWarning)

## <ggplot: (7573341802)>
## 
## /Users/puthier/miniconda3/envs/pygtftk/lib/python3.6/site-packages/matplotlib/pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
##   max_open_warning, RuntimeWarning)

## <ggplot: (7577656247)>
## 
## /Users/puthier/miniconda3/envs/pygtftk/lib/python3.6/site-packages/matplotlib/pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
##   max_open_warning, RuntimeWarning)

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

Wilkinson, Leland. 2016. The Grammar of Graphics (Statistics and Computing) 2nd Edition. Springer.