When doing descriptive statistics we frequently need to partition the graphics based on categorical (i.e. qualitative) or ordinal variables. Doing such graphics may be particularly difficult when using classical Python graphical libraries (e.g matplotlib). The R software benefits from a very nice library for such a task, the ggplot2 package developed by Hadley Wickham (Wickham 2016) This package has been quickly became really popular in the bioinformatic field (here categories may be gene, groups of genes, species, signaling pathways, epigenetic marks…) and ordinal variables a discretized level of expression, for instance. The ggplot2 R package is an implementation of the graphical model proposed by Leland Wilkison in his book: The Grammar of Graphics (Wilkinson 2016). In this model, the graph is viewed as an entity composed of data, layers, scales, coordinate system and facets. One can create a graphic then add the various component using the + operator. Although the syntax may appear a little bit tricky for beginners, one can quickly understand the benefit of such an approach when composing complexe diagrams.
Several projects have proposed a port of ggplot2 under Python. The plotnine library is one of these projects that proposes a rather stable and exhaustive port of ggplot2 under Python. In the subsequent tutorial, we will use the chickwts dataset that is available in the R datasets library. We will propose this dataset through an URL available for download. The information we have about the chickwts dataset are the following:
Chicken Weights By Feed Type: An experiment was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens.
We will download the dataset (a tabulated flat file) and load it into a pandas DataFrame. The DataFrame¨ object is also a port from a popular object in the R world (data.frame). This DataFrame can be viewed as a matrix whose columns may be of various types (Objects, int64, floats…). The DataFrame object contains various functions to perform operations on the dataset.
import pandas as pd
## The URL of the dataset
GITHUB_REPO = "https://raw.githubusercontent.com/dputhier/jgb53d-bd-prog_github/master/"
PATH="practicals/intro_ggplot/chickwts.txt"
## Read the dataset into a
chickwts = pd.read_csv(filepath_or_buffer=GITHUB_REPO+PATH, header=0, sep="\t")
print(chickwts)
## weight feed
## 0 179 horsebean
## 1 160 horsebean
## 2 136 horsebean
## 3 227 horsebean
## 4 217 horsebean
## 5 168 horsebean
## 6 108 horsebean
## 7 124 horsebean
## 8 143 horsebean
## 9 140 horsebean
## 10 309 linseed
## 11 229 linseed
## 12 181 linseed
## 13 141 linseed
## 14 260 linseed
## 15 203 linseed
## 16 148 linseed
## 17 169 linseed
## 18 213 linseed
## 19 257 linseed
## 20 244 linseed
## 21 271 linseed
## 22 243 soybean
## 23 230 soybean
## 24 248 soybean
## 25 327 soybean
## 26 329 soybean
## 27 250 soybean
## 28 193 soybean
## 29 271 soybean
## .. ... ...
## 41 226 sunflower
## 42 320 sunflower
## 43 295 sunflower
## 44 334 sunflower
## 45 322 sunflower
## 46 297 sunflower
## 47 318 sunflower
## 48 325 meatmeal
## 49 257 meatmeal
## 50 303 meatmeal
## 51 315 meatmeal
## 52 380 meatmeal
## 53 153 meatmeal
## 54 263 meatmeal
## 55 242 meatmeal
## 56 206 meatmeal
## 57 344 meatmeal
## 58 258 meatmeal
## 59 368 casein
## 60 390 casein
## 61 379 casein
## 62 260 casein
## 63 404 casein
## 64 318 casein
## 65 352 casein
## 66 359 casein
## 67 216 casein
## 68 222 casein
## 69 283 casein
## 70 332 casein
##
## [71 rows x 2 columns]
What about the type of object returned by pandas.read_csv() ? What about the types of the columns ?
## <class 'pandas.core.frame.DataFrame'>
## weight int64
## feed object
## dtype: object
The boxplot and violin plots can be used to display the distributions of the underlying variables:
## loading the plotnine package
## Here we load all object enclosed in the package.
## This is useful for analysis but may be harmful
## in a development context
from plotnine import *
## Then we declare a new graphics and associate
## a dataset. Here the aes (aesthetic) argument is set
## to feed and weight.
p = ggplot(data=chickwts, mapping=aes( x='feed', y='weight'))
## We have to indicate the type of requested graphics.
## Here, a boxplot.
print(p + geom_boxplot())
## <ggplot: (7557755824)>
## <ggplot: (7557762785)>
The diagram theme can be changed using call to functions from plotnine starting with theme_.
## <ggplot: (-9223372029294182200)>
## <ggplot: (-9223372029296720864)>
The diagrams can be tweaked more deeply by passing some arguments to the theme() function. Several aspects of the diagram are thus themeable. The list of themeable elements is provided here. The themeables are objects of several classes:
For instance, the following code changes various elements of the theme:
my_plot = p + geom_boxplot() + theme(axis_text=element_text(color='red',size=10,rotation=45),
axis_title=element_text(color='green',size=10),
axis_ticks=element_line(color='violet',size=20),
panel_grid_major=element_line(color='lightblue',size=1),
panel_grid_minor=element_line(color='orange',size=1),
)
print(my_plot)
## <ggplot: (-9223372029296779204)>
# example answer
my_plot = p + geom_boxplot() + theme(axis_text=element_text(color='#e41a1c',size=10,rotation=45),
axis_title=element_text(color='#377eb8',size=10),
axis_ticks=element_line(color='#4daf4a',size=20),
panel_grid_major=element_line(color='#984ea3',size=1),
panel_grid_minor=element_line(color='#ff7f00',size=1),
plot_background=element_rect(color='#a6cee3',size=2, fill='#a6cee3')
)
print(my_plot)
## <ggplot: (-9223372029292887373)>
At the moment, we have defined, for a boxplot or violin plot, the categories that should appear on the x axis and the values whose distributions should appear on the y axis. We may also assign additional aesthetics. These additional aesthetics may be ‘fill’ (for the color of the boxes in the boxplot) or the ‘color’ for the colors of the box borders (there are also additional aesthetic depending on the geom function).
If ‘fill’ and ‘color’ are passed to the geom_boxplot() this mean that colors should be the same for all boxes.
## Then we declare a new graphics and associate
## a dataset. Here the aes (aesthetic) argument is set
## to feed and weight.
my_plot = ggplot(data=chickwts,
mapping=aes( x='feed', y='weight')) + geom_boxplot(color='#984ea3', fill='#ff7f00')
print(my_plot)
## <ggplot: (-9223372029292431374)>
Now, another solution is to path ‘fill’ and ‘colors’ to the aes() function. In this case it means that we want to change the colors according to the categories found in ‘feed’. This also mean that we need a way to tell plotnine which colors we want to apply as it will just use a set of default colors.
## Then we declare a new graphics and associate
## a dataset. Here the aes (aesthetic) argument is set
## to feed and weight.
my_plot = ggplot(data=chickwts,
mapping=aes( x='feed', y='weight', color='feed', fill='feed')) + geom_boxplot()
print(my_plot)
## <ggplot: (7563094101)>
To change the colors we now need to use the scale_color_manual() and scale_fill_manual() functions to which we can pass a dictionary containing the classes to colors mapping. Here the classes are the following:
## Index(['casein', 'horsebean', 'linseed', 'meatmeal', 'soybean', 'sunflower'], dtype='object')
So we just need to create a dictionnary containing the classes (i.e ‘feed’) and the associated colors.
color_dict = {'casein': 'red',
'horsebean': 'darkblue',
'linseed': 'violet',
'meatmeal': 'orange',
'soybean': 'yellow',
'sunflower':'pink'}
my_plot = ggplot(data=chickwts,
mapping=aes( x='feed', y='weight', color='feed', fill='feed')) + geom_boxplot()
my_plot += scale_color_manual(values=color_dict)
my_plot += scale_fill_manual(values=color_dict)
print(my_plot)
## <ggplot: (7563644238)>
We may improve this plot by changing the legend attributes.
my_plot += theme(legend_position="bottom",
legend_text=element_text(size=8),
legend_key=element_rect(colour="white", fill="white"),
legend_title=element_blank())
print(my_plot)
## <ggplot: (7563644238)>
color_dict = {'casein': '#80FF80',
'horsebean': '#BF0000',
'linseed': '#BFFF40',
'meatmeal': '#00FFFF',
'soybean': '#0000FF',
'sunflower':'#0000BF'}
my_plot = ggplot(data=chickwts,
mapping=aes( x='feed', y='weight', fill='feed')) + geom_violin(color='black')
my_plot += scale_fill_manual(values=color_dict)
my_plot += theme(legend_position="bottom",
legend_text=element_text(size=8),
legend_key=element_rect(colour="white", fill="white"),
legend_title=element_blank())
print(my_plot)
## <ggplot: (7564757368)>
There are about 40 different graphics currently available in plotnine. The names of the associated functions start with ’geom_’ (e.g.* geom_boxplot, geom_tile, geom_text, geom_smooth, geom_rug, geom_hist, geom_bar…).
In the case of histograms, the x axis corresponds to intervals (‘bins’) and the y axis to the number of values falling in the intervals. Thus, there is only one value to pass to aes().
## Here the code for an histogram
my_plot = ggplot(data=chickwts, mapping=aes(x='weight')) + geom_histogram(color='white', bins=10)
print(my_plot)
## <ggplot: (7564864681)>
Probability density can be displayed using geom_density(). Unfortunatly, the number of data in each class is clearly too limited here. We will see a better example later.
my_plot = ggplot(data=chickwts, mapping=aes(x='weight', fill='feed')) + geom_density(adjust = 1/4, alpha=0.5)
print(my_plot)
## <ggplot: (-9223372029289950404)>
One can overlay several diagrams of various types very easily, just using the ‘+’ operator. For instance one can first create a simple scatterplot using geom_point().
## We can, for instance, show the values associated to each feed
my_plot = ggplot(data=chickwts,
mapping=aes( x='feed', y='weight', fill='feed', color='feed')) + geom_point(size=4)
print(my_plot)
## <ggplot: (7563095048)>
However as they are some ties it may be advised to use the geom_jitter() function that will add some randomness to the value of the x axis (that here are categorical but can be viewed as 1, 2, 3…)
## <ggplot: (7563095048)>
As diagrams can be viewed as layers, we may also add a geom_rug() layer.
## <ggplot: (7563095048)>
Partitioning the diagram based on a given factor/variable allows one to explore deeply the dataset. For the following example, we will create a matrix containing the results of an artificial ELISA experiment in which several measures done at two different times and by four different researchers are recorded.
URL = "https://tinyurl.com/ycdhmof8"
elisa = pd.read_csv(filepath_or_buffer=URL, header=0, sep="\t")
elisa.head()
The column names of the DataFrame are the following:
The number of elements:
These are ELISA plates so each of them contains 96 elements.
Using plotnine or ggplot2 syntax, it becomes very easy to assess the distribution of the results (available in the ‘values’ column) depending on users:
my_plot = ggplot(data = elisa, mapping = aes(x='value'))
my_plot += geom_histogram(fill="blue", color="white", bins=20)
my_plot += theme_bw()
my_plot += theme(strip_background = element_rect(colour = "white", fill="orange"))
print(my_plot + facet_grid(facets = '~ user'))
## <ggplot: (-9223372029288677169)>
Interestingly, one can also easily create facets based on two variables, here user and days:
## <ggplot: (-9223372029286675175)>
Or alternatively…
## <ggplot: (7569214079)>
Or even
## <ggplot: (7569844400)>
As we are working with an artificial ELISA dataset, it can be interesting to reproduce a color-coded image of the ‘original’ ELISA plates. The geom_tile() function can be used to create heatmaps.
my_plot = ggplot(data = elisa, mapping = aes(x='columns', y='rows', fill='value'))
my_plot += geom_tile()
my_plot += theme(axis_text_x = element_text(size = 10),
axis_text_y = element_text(size = 10),
panel_background = element_rect(fill="white"),
panel_grid_major = element_line(colour = "white"),
strip_background = element_rect(colour = "orange", fill="orange"))
my_plot += scale_fill_gradientn(colors = ["#0000BF", "#0000FF",
"#0080FF", "#00FFFF",
"#40FFBF", "#80FF80",
"#BFFF40", "#FFFF00",
"#FF8000", "#FF0000",
"#BF0000"])
print(my_plot + facet_grid(facets = 'day ~ user'))
## <ggplot: (-9223372029288673287)>
##
## /Users/puthier/miniconda3/envs/pygtftk/lib/python3.6/site-packages/matplotlib/pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
## max_open_warning, RuntimeWarning)
Here, our dataset contains several informations related to transcripts in the human genome. They were computed using pygtftk (v0.9.8) from a GTF file downloaded from ensembl (genome version GRCh38, release 92). First we will load the dataset, set the row names to the transcript ids and inspect the column names.
## The URL pointing to the dataset
URL = "https://tinyurl.com/hg38-tx-info-txt"
## Read the dataset into a pandas DataFrame
## The transcript names are located in column 11 (10 in zero-based)
## The first line (0 in zero-based) is the header
tx_info = pd.read_csv(filepath_or_buffer=URL, header=0, sep="\t", index_col=10)
## <class 'pandas.core.frame.DataFrame'>
## 178654
## 31
# What are the columns names ?
# or in a more readable way: sorted(tx_info.columns)
print(sorted(tx_info.columns.tolist()))
## ['ccds_id', 'convergent', 'dist_prom_overlap_tss_from_other_gene', 'dist_to_convergent', 'dist_to_divergent', 'divergent', 'end', 'exon_sizes', 'feature', 'gene_biotype', 'gene_id', 'gene_name', 'gene_source', 'gene_version', 'intron_size', 'mature_rna_size', 'nb_exons', 'phase', 'prom_overlap_tss_from_other_gene', 'score', 'seqid', 'source', 'start', 'strand', 'tag', 'transcript_biotype', 'transcript_name', 'transcript_source', 'transcript_support_level', 'transcript_version', 'tx_genomic_size']
In the subsequent analysis we will only focus on transcript classes (‘gene_biotype’) for which at least 500 transcripts are found.
## The global distribution
my_plot = ggplot(data=tx_info, mapping=aes( x='tx_genomic_size'))
my_plot += geom_histogram(bins=100, fill="#E69F00", color="white")
my_plot += scale_x_continuous(trans='log10')
my_plot += theme_bw()
print(my_plot)
## <ggplot: (-9223372029273113533)>
##
## /Users/puthier/miniconda3/envs/pygtftk/lib/python3.6/site-packages/matplotlib/pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
## max_open_warning, RuntimeWarning)
## The distribution faceted by chromosomes
## There seems not to be any particular association between
## Chromosomes and transcript length
print(my_plot + facet_grid('seqid~', scales="free_y") + theme(strip_text=element_text(rotation=0, size=5), panel_grid=element_blank(), axis_text_y=element_text(size=4), panel_spacing=0))
## <ggplot: (-9223372029275049184)>
##
## /Users/puthier/miniconda3/envs/pygtftk/lib/python3.6/site-packages/matplotlib/pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
## max_open_warning, RuntimeWarning)
## The distribution faceted by chromosomes
## There seems not to be any particular association between
## Chromosomes and transcript length
print(my_plot + facet_grid('gene_biotype~', scales="free_y") + theme(strip_text=element_text(rotation=0, size=5), panel_grid=element_blank(), axis_text_y=element_text(size=4), panel_spacing=0.2))
## <ggplot: (-9223372029280328637)>
##
## /Users/puthier/miniconda3/envs/pygtftk/lib/python3.6/site-packages/matplotlib/pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
## max_open_warning, RuntimeWarning)
## The distribution faceted by chromosomes
## can be viewed alternatively using a density
my_plot = ggplot(data=tx_info, mapping=aes( x='tx_genomic_size', fill='gene_biotype'))
my_plot += geom_density(alpha=0.5)
my_plot += scale_x_continuous(trans='log10')
my_plot += theme_bw()
my_plot += theme(strip_text=element_text(rotation=0, size=5), panel_grid=element_blank(), axis_text_y=element_text(size=4), panel_spacing=0.2)
my_plot += facet_grid('gene_biotype~', scales="free_y")
print(my_plot)
## <ggplot: (-9223372029280332968)>
##
## /Users/puthier/miniconda3/envs/pygtftk/lib/python3.6/site-packages/matplotlib/pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
## max_open_warning, RuntimeWarning)
## The global distribution
my_plot = ggplot(data=tx_info, mapping=aes( x='gene_biotype', y='nb_exons', fill='gene_biotype', color='gene_biotype'))
my_plot += geom_boxplot()
my_plot += scale_y_continuous(trans='log2')
my_plot += theme_bw()
my_plot += ylab('log2(number of exons)')
my_plot += theme(axis_text_x=element_text(size=6, rotation=45),
panel_grid_minor=element_line(size=1))
my_plot += geom_abline(intercept = 1, slope=0)
print(my_plot)
## <ggplot: (-9223372029281954602)>
##
## /Users/puthier/miniconda3/envs/pygtftk/lib/python3.6/site-packages/matplotlib/pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
## max_open_warning, RuntimeWarning)
For the next diagrams we will order the chromosome from 1 to 22 then X, Y. Plotnine doesn’t know about the chromosomes order. By default it will just display them the way they were encountered. However, we can define the order of a qualitative variable. To this aim we need to use pd.Categorical() that can change a dataframe column so that it becomes a categorial variable that can even be ordered (ordinal variable):
## Define an ordinal variable
chrom_list = ["chr" + str(x) for x in range(1,23) ]
chrom_list += ['chrX', 'chrY']
tx_info.seqid = pd.Categorical(tx_info.seqid.tolist(), categories=chrom_list, ordered=True)
## The categories are know part
## of the column definition
print(tx_info.seqid.head())
## transcript_id
## ENST00000456328 chr1
## ENST00000450305 chr1
## ENST00000488147 chr1
## ENST00000473358 chr1
## ENST00000469289 chr1
## Name: seqid, dtype: category
## Categories (24, object): [chr1 < chr2 < chr3 < chr4 ... chr21 < chr22 < chrX < chrY]
# geom bar with position "stack"
my_plot = ggplot(tx_info, aes(x='seqid', fill='gene_biotype', label='stat(count)'))
my_plot += theme(axis_text_x = element_text(angle = 45, hjust=1),
strip_text_x = element_text(size = 7, colour = 'black', angle = 90),
legend_position='top')
print(my_plot + geom_bar(color="white", position='stack'))
## <ggplot: (-9223372029278027797)>
##
## /Users/puthier/miniconda3/envs/pygtftk/lib/python3.6/site-packages/matplotlib/pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
## max_open_warning, RuntimeWarning)
## <ggplot: (7573341802)>
##
## /Users/puthier/miniconda3/envs/pygtftk/lib/python3.6/site-packages/matplotlib/pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
## max_open_warning, RuntimeWarning)
## <ggplot: (7577656247)>
##
## /Users/puthier/miniconda3/envs/pygtftk/lib/python3.6/site-packages/matplotlib/pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
## max_open_warning, RuntimeWarning)
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.
Wilkinson, Leland. 2016. The Grammar of Graphics (Statistics and Computing) 2nd Edition. Springer.