Commands from section ‘ologram’¶
In the examples of this section, we will need the following example files:
$ gtftk get_example -q -d simple -f '*'
$ gtftk get_example -q -d mini_real -f '*'
$ gtftk get_example -q -d hg38_chr1 -f '*'
$ gtftk get_example -q -d ologram_1 -f '*'
$ gtftk get_example -q -d simple_07 -f '*'
$ gtftk get_example -q -d ologram_2 -f '*'
For more information about OLOGRAM and OLOGRAM-MODL, please see the appropriately titled papers in the Citing section.
More examples can be found in <https://github.com/qferre/ologram_supp_mat> and more recently at <https://github.com/qferre/ologram-modl_supp_mat> These contain example Snakemake workflows, that can be reused or from which commands can be extracted.
Most of the commands presented in this section are demonstrated in the ologram-modl_supp_mat Git, along with certain perspectives.
Note for contributors : To contribute to OLOGRAM, begin at pygtftk/plugins/ologram.py and unwrap function calls from there, to get a sense of how they interact. We have detailed comments to explain the role of every function. A detailed table with the role of each file is presented at the end of this document.
ologram¶
Description: OLOGRAM – OverLap Of Genomic Regions Analysis using Monte Carlo. Ologram annotates peaks (in bed format) using (i) genomic features extracted from a GTF file (e.g promoter, tts, gene body, UTR…) (ii) genomic regions tagged with particular keys/values in a GTF file (e.g. gene_biotype “protein_coding”, gene_biotype “LncRNA”…) or (iii) from a BED file (e.g. user-defined regions). Each couple peak file/region is randomly shuffled across the genome (inter-region lengths are considered). Then the probability of intersection under the null hypothesis (the peaks and this feature are independent) is deduced thanks to this Monte Carlo approach. The program will return statistics for both the number of intersections and the total lengths (in basepairs) of all intersections.
Note
The null hypothesis of the statistical test is: - H0: The regions of the query (–peak-file) are located independently of the reference (–inputfile or –more-bed) with respect to overlap. - H1: The regions of the query (–peak-file) tend to overlap the reference (–inputfile or –more-bed).
Warning
The ologram examples below use 8 CPUs. Please adapt the number of threads.
Example: Perform a basic annotation. We are searching whether H3K4me3 peaks tends to be enriched in some specific genomic elements. The bars in the bar plot diagram will be ordered according to ‘summed_bp_overlaps_pvalue’.
$ gtftk ologram -i hg38_chr1.gtf.gz -p ENCFF112BHN_H3K4me3_chr1.bed -c hg38_chr1.genome -u 1500 -d 1500 -D -pf example_pa_01.pdf -k 8 -j summed_bp_overlaps_pvalue
|-- 10:52-WARNING-ologram : Using only 8 threads, but 16 cores are available. Consider changing the --nb-threads parameter.
/Users/puthier/anaconda3/envs/pygtftk_py3.9_dev/lib/python3.9/site-packages/pygtftk-1.6.2-py3.9-macosx-10.9-x86_64.egg/pygtftk/stats/beta.py:225: RuntimeWarning: invalid value encountered in sqrt
/Users/puthier/anaconda3/envs/pygtftk_py3.9_dev/lib/python3.9/site-packages/pygtftk-1.6.2-py3.9-macosx-10.9-x86_64.egg/pygtftk/stats/beta.py:237: RuntimeWarning: invalid value encountered in sqrt
/Users/puthier/anaconda3/envs/pygtftk_py3.9_dev/lib/python3.9/site-packages/pygtftk-1.6.2-py3.9-macosx-10.9-x86_64.egg/pygtftk/stats/beta.py:225: RuntimeWarning: invalid value encountered in sqrt
/Users/puthier/anaconda3/envs/pygtftk_py3.9_dev/lib/python3.9/site-packages/pygtftk-1.6.2-py3.9-macosx-10.9-x86_64.egg/pygtftk/stats/beta.py:237: RuntimeWarning: invalid value encountered in sqrt
|-- 10:53-WARNING-ologram : Computing log(p-val) for a Neg Binom with mean >= var ; var was set to mean+1 (start_codon)
/Users/puthier/anaconda3/envs/pygtftk_py3.9_dev/lib/python3.9/site-packages/pygtftk-1.6.2-py3.9-macosx-10.9-x86_64.egg/pygtftk/stats/beta.py:225: RuntimeWarning: invalid value encountered in sqrt
/Users/puthier/anaconda3/envs/pygtftk_py3.9_dev/lib/python3.9/site-packages/pygtftk-1.6.2-py3.9-macosx-10.9-x86_64.egg/pygtftk/stats/beta.py:237: RuntimeWarning: invalid value encountered in sqrt
/Users/puthier/anaconda3/envs/pygtftk_py3.9_dev/lib/python3.9/site-packages/pygtftk-1.6.2-py3.9-macosx-10.9-x86_64.egg/pygtftk/stats/beta.py:225: RuntimeWarning: invalid value encountered in sqrt
/Users/puthier/anaconda3/envs/pygtftk_py3.9_dev/lib/python3.9/site-packages/pygtftk-1.6.2-py3.9-macosx-10.9-x86_64.egg/pygtftk/stats/beta.py:237: RuntimeWarning: invalid value encountered in sqrt
Example: We are now using the gene_biotype key (note that a list of keys can be provided). This will tell us whether H3K4me3 tends to be located in particular transcripts (protein coding, LncRNAs…). The –no-basic-feature argument tells ologram not to test basic genomic elements (gene, transcripts…).
$ gtftk select_by_key -i mini_real.gtf.gz -k gene_biotype -v protein_coding,lincRNA,antisense,processed_transcript | gtftk ologram -m gene_biotype -p ENCFF112BHN_H3K4me3_K562_sub.bed -c hg38 -D -n -pf example_pa_02.pdf -k 8 -j summed_bp_overlaps_pvalue
|-- 10:53-WARNING-ologram : Using only 8 threads, but 16 cores are available. Consider changing the --nb-threads parameter.
Warning
It may be important to consider the quality of the fit that is an indicator of the reliability of the p-value. This value is available in the tsv table produced by ologram. The fit quality may also be deplaced on the diagram using the -y/–display-fit-quality argument.
Example: A more complex example where the key is created on the fly. Expression data are loaded as a novel key using the join_attr command and associated to gene features. This novel key (exprs) is then discretized to created 6 classes of genes with increasing expression (based on percentiles, -p) which are tested for enrichment in H3K36me3.
$ gtftk join_attr -i mini_real.gtf.gz -H -j mini_real_counts_ENCFF630HEX.tsv -k gene_name -n exprs -t exon | gtftk discretize_key -k exprs -p -d exprs_class -n 6 -u | gtftk ologram -p ENCFF119BYM_H3K36me3_K562_sub.bed -c hg38 -D -n -m exprs_class -pf example_pa_03.pdf -k 8 -j summed_bp_overlaps_pvalue
|-- 10:54-WARNING-ologram : Using only 8 threads, but 16 cores are available. Consider changing the --nb-threads parameter.
|-- 10:54-INFO-discretize_key : Categories: ['[0.0_183.0]', '(183.0_549.0]', '(549.0_1018.0]', '(1018.0_1631.0]', '(1631.0_3139.0]', '(3139.0_41703.0]']
/Users/puthier/anaconda3/envs/pygtftk_py3.9_dev/lib/python3.9/site-packages/pygtftk-1.6.2-py3.9-macosx-10.9-x86_64.egg/pygtftk/stats/beta.py:225: RuntimeWarning: invalid value encountered in sqrt
/Users/puthier/anaconda3/envs/pygtftk_py3.9_dev/lib/python3.9/site-packages/pygtftk-1.6.2-py3.9-macosx-10.9-x86_64.egg/pygtftk/stats/beta.py:237: RuntimeWarning: invalid value encountered in sqrt
/Users/puthier/anaconda3/envs/pygtftk_py3.9_dev/lib/python3.9/site-packages/pygtftk-1.6.2-py3.9-macosx-10.9-x86_64.egg/pygtftk/stats/beta.py:225: RuntimeWarning: invalid value encountered in sqrt
/Users/puthier/anaconda3/envs/pygtftk_py3.9_dev/lib/python3.9/site-packages/pygtftk-1.6.2-py3.9-macosx-10.9-x86_64.egg/pygtftk/stats/beta.py:237: RuntimeWarning: invalid value encountered in sqrt
Example: Using the add_exon_nb, we add the exon number transcript-wise (numbering from 5’ to 3’) and discretize this novel key into 5 classes tested for enrichment.
$ gtftk add_exon_nb -k exon_nbr -i mini_real.gtf.gz | gtftk discretize_key -p -d exon_nbr_cat -n 5 -k exon_nbr | gtftk ologram -p ENCFF112BHN_H3K4me3_K562_sub.bed -c hg38 -D -n -m exon_nbr_cat -pf example_pa_04.pdf -k 8 -j summed_bp_overlaps_pvalue
|-- 10:55-WARNING-ologram : Using only 8 threads, but 16 cores are available. Consider changing the --nb-threads parameter.
|-- 10:55-INFO-discretize_key : Categories: ['[1.0_2.0]', '(2.0_4.0]', '(4.0_6.0]', '(6.0_12.0]', '(12.0_107.0]']
Example: When not supplying a GTF, you can use –more-bed. The following example will look for pairwise enrichment of the file in input (p, here query.bed with the regions defined in –more-bed : here query with A.bed, then query with B.bed, then query with C.bed.
gtftk ologram -ms 40 -mn 10 -p query.bed --more-bed A.bed B.bed C.bed -z -c hg38 -V 3 --force-chrom-peak --force-chrom-more-bed
ologram (multiple overlaps)¶
It is also possible to use the OLOGRAM-MODL Multiple Overlap Dictionary Learning) plugin to find multiple overlaps (ie. between n>=2 sets) enrichment (ie. Query+A+B, Query+A+C, …) in order to highlight combinations of genomic regions, such as Transcriptional Regulator complexes.
Note
The null hypothesis of the statistical test is: - H0: Considering the genomic regions in the query set (–peak-file) and in the reference sets (–more-bed), the regions in one set are located independently of the regions in any another set. They are not assumed to be uniformly distributed, we keep inter-region lengths.
This is done only on custom regions supplied as BEDs supplied with the –more-bed argument. In most cases you may use the –no-gtf argument and only pass the regions of interest.
For statistical reasons, we recommend shuffling across a relevant subsection of the genome only (ie. enhancers only) using –bed-excl or –bed-incl. This ensures the longer combinations have a reasonable chance of being randomly encountered in the shuffles. Conversely, if you do not filter the combinations, keep in mind that the longer ones may be enriched even though they are present only on a few base pairs, because at random they would be even rarer. As such, we recommend focusing comparisons on combinations of similar order (number of sets).
Exact combinations: By default, OLOGRAM will compute “inexact” combinations, meaning that when encountering an overlap of [Query + A + B + C] it will still count as an observation of [Query + A + B + …] (meaning “Query + A + B + any other set”). For exact intersections (ie. [Query + A + B + nothing else]), set the –exact flag to True. You will know if the combinations are computed as inexact by the ‘…’ in their name in the result file.
In any case, only intersections with the query are counted. ie. Query+A+B is counted, but A+B+C is not.
With inexact combinations, if A+B is very enriched and C is depleted, A+B+C will be enriched. It is more interesting to look at C’s contribution to the enrichment. Relatedly, longer combinations are usually more enriched since they involve more theoretically independant sets. Relatedly, you should compare the enrichments of combinations of similar orders (number of sets in the combinations) since longer combinations tend to be more enriched under (H_0).
Simple example:
Comparing the query (-p) against two other BED files, analyzing multiple overlaps.
$ gtftk ologram -z -w -q -c simple_07.chromInfo -p simple_07_peaks.bed --more-bed simple_07_peaks.1.bed simple_07_peaks.2.bed --more-bed-multiple-overlap
|-- 10:55-WARNING : Converting to bed6 format (simple_07_peaks.bed).
|-- 10:55-WARNING : Converting to bed6 format (simple_07_peaks.1.bed).
|-- 10:55-WARNING : Converting to bed6 format (simple_07_peaks.2.bed).
|-- 10:55-WARNING-ologram : Using only 1 threads, but 16 cores are available. Consider changing the --nb-threads parameter.
|-- 10:55-WARNING-ologram : --more-bed-labels was not set, automatically defaulting to --more-bed file names.
|-- 10:55-WARNING-ologram : [Query + simple_07_peaks_1 + ... ]: there may be a poor fit for this feature. Check fit quality in the results. This is likely due to there being too few regions.
|-- 10:55-WARNING-ologram : [Query + simple_07_peaks_1 + simple_07_peaks_2 + ... ]: there may be a poor fit for this feature. Check fit quality in the results. This is likely due to there being too few regions.
|-- 10:55-WARNING-ologram : Computing log(p-val) for a Neg Binom with mean >= var ; var was set to mean+1 ([Query + simple_07_peaks_1 + simple_07_peaks_2 + ... ])
Detailed example:
gtftk ologram -z -c simple_07.chromInfo -p simple_07_peaks.bed # The query (-p) is the file to compare against.
--more-bed simple_07_peaks.1.bed simple_07_peaks.2.bed # List of BED files giving the region sets to compare with. TIP: You can use --more-bed `ls -d ./data/*` if all your files are in the 'data' subdirectory
-o results --force-chrom-peak --force-chrom-more-bed
-V 3 -k 8 -mn 10 -ms 10 # Verbosity, threads, number and size of minibatches
--more-bed-multiple-overlap # Toggle the computation of multiple overlaps on the --more-bed
--exact # OPTIONAL ARGUMENT. If present, an observation of A+B+C will not count as an observation of A+B.
--multiple-overlap-max-number-of-combinations 10 # OPTIONAL ARGUMENT. Use MODL to restrict to this many combinations.
--multiple-overlap-target-combi-size 3 # OPTIONAL ARGUMENT. Combis mined longer than this size will not be shown.
--multiple-overlap-custom-combis test_combis.txt # OPTIONAL ARGUMENT. Will bypass the selection by the previous two arguments and work only on the combinations defined in this file.
--keep-intact-in-shuffling 0,1 # BETA - OPTIONAL ARGUMENT. Gives the positions of the files in --more-bed that will be kept fixed in shuffling.
See the result of gtftk ologram -h below for more detailed informations about the arguments’ formats.
As the computation of multiple overlaps can be RAM-intensive, if you have a very large amount of candidate genomic feature sets (hundreds) we recommend selecting less candidates among them first by running a pairwise analysis.
MODL itemset mining algorithm: By default, OLOGRAM-MODL will compute the enrichment of all n-wise combinations that are encountered in the real data it was passed. This however can add up to 2**N combinations and make the result hard to read. Furthermore, in biological data noise is a real problem and can obscure the relevant combinations. As such, we also give the option to use a custom itemset mining algorithm on the true overlaps to identify interesting combinations. Another possibility is to instead manually pass a text file containg the combinations you want to study
Itemset mining details¶
In broad strokes, the custom itemset algorithm MODL (Multiple Overlap Dictionary Learning) will perform many matrix factorizations on the matrix of true overlaps to identify relevant correlation groups of genomic regions. Then a greedy algorithm based on how much these words improve the reconstruction will select the utmost best words. MODL is only used to filter the output of OLOGRAM : once it returns a list of interesting combination, OLOGRAM will compute their enrichment as usual, but for them only. Each combination is of the form [Query + A + B + C] where A, B and C are BED files given as –more-bed. You can also manually specify the combinations to be studied with the format defined in OLOGRAM notes (below).
Unlike classical association rules mining algorithms, this focuses on mining relevant biological complexes/clusters and correlation groups (item sets). As such, we do not recommend asking for more than 20-50 combinations to keep the running time reasonable and keep the found combinations still relevant.
As a matrix factorization based algorithm, it is designed to be resistant to noise which is a known problem in biological data. Its goal is to extract meaningful frequent combinations from noisy data. As a result however, it is biased in favor of the most abundant combinations in the data, and may return correlation groups if you ask for too few words (ie. if AB, BC and AC are complexes, ABC might be returned).
This itemset mining algorithm is a work-in-progress, and optional . Whether you use MODL will not change the results for each combination, it only changes which combinations are displayed. If you want the enrichment of all combinations, ignore it. To use MODL, use the –multiple-overlap-max-number-of-combinations argument.
MODL is mostly needed when the list of --more-bed is very long and you do not want to filter the results manually, and when you are working with noisy data which could obfuscate the interesting combinations. It is also possible to bypass it and provide a custom list of combinations to be considered.
MODL algorithm API: MODL can also be used independantly as a combination mining algorithm.
This can work on any type of data, biological or not, that respects the conventional formatting for lists of transactions: the data needs to be a matrix with one line per transaction and one column per element. For example, if you have three possible elements A, B and C, a line of [1,0,1] means a transaction containing A and C.
For a factor allowance of k and n final queried words, the matrix will be rebuilt with k*n words in step 1. MODL will discard combinations rarer than 1/10000 occurences to reduce computing times. It will also reduce the abundance of all unique lines in the matrix to their square roots to reduce the emphasis on the most frequent elements. However, the latter can magnify the impact of the noise as well and can be disabled when using the manual API. To de-emphasize longer words, which can help in this case, we normalize words by their summed square in step 2.
If you are passing a custom error function, it must have this signature: error_function(X_true, X_rebuilt, encoded, dictionary). X_true is the real data, and X_rebuilt is the reconstruction to evaluate. encoded is the encoded version (U) which in our case is used to assess sparsity, while dictionary (V) is the matrix with one atom of the dictionaty per row (not used by default). Note that the dictionary is passed before MODL performs any normalization on it. All are NumPy matrices.
Note
An example of custom loss we recommend is: selecting the combinations (of reference sets) that best predict the query set using a Naive Bayes classifier. This is not yet implemented, but a fully functional example is available at <https://github.com/qferre/ologram-modl_supp_mat/blob/master/scripts/modl_perspective.py> as a Python script. To use it, simply replace the filepaths at the beginning with the paths to your own files, and run the script. The order in the selection will be the same as the order you gave in the script, not alphabetical. You can then run OLOGRAM without MODL, or pass the custom selection you just computed.
For more details, see code comments.
Here is an example:
from pygtftk.stats.intersect.modl.dict_learning import Modl, test_data_for_modl
flags_matrix = test_data_for_modl(nflags = 1000, number_of_sets = 6, noise = 0.1, cor_groups = [(0,1),(0,1,2,3),(4,5)])
from pygtftk import utils
utils.VERBOSITY = 2 # Ensure DEBUG messages are shown
# Simple example of custom error function
def my_error_function (X_true, X_rebuilt, encoded, dictionary): return np.sum(X_rebuilt - X_true)
combi_miner = Modl(flags_matrix,
multiple_overlap_target_combi_size = -1, # Limit the size of the combinations
multiple_overlap_max_number_of_combinations = 3, # How many words to find ?
nb_threads = 1,
step_1_factor_allowance = 2, # (Defaults to 2) How many words to ask for in each step 1 rebuilding, as a multiplier of multiple_overlap_max_number_of_combinations.
error_function = None, # (OPTIONAL) Custom error function in step 2
smother = True, # (Defaults to True) Should the smothering (quadratic reduction of abundance) be applied ?
normalize_words = True, # (Defaults to True) Normalize words by their summed squared in step 2 ?
step_2_alpha = None, # (OPTIONAL) Override the alpha (sparsity control) used in step 2.
discretization_threshold = 0 # (Defaults to 1) Discretization threshold D : in each atom, elements below D*maximum_for_this_atom will be discarded.
step_1_alphas = None) # (OPTIONAL) Override the list of alphas used in step 1 (should be a list)
interesting_combis = combi_miner.find_interesting_combinations()
For more details about usage and implementation, please read the notes below.
Arguments:
$ gtftk ologram -h
Usage: gtftk ologram [-i GTF] [-c TXT] -p BED [-b [more_bed ...]] [-l more_bed_labels] [-kis keep_intact_in_shuffling] [-e BED] [-bi BED] [-u upstream] [-d downstream] [-m more_keys] [-n] [-mo] [-mocs multiple_overlap_target_combi_size] [-monc multiple_overlap_max_number_of_combinations] [-moc multiple_overlap_custom_combis] [-ex] [-k nb_threads] [-s seed] [-mn minibatch_nb] [-ms minibatch_size] [-ma] [-o DIR] [-pw pdf_width] [-ph pdf_height] [-pf pdf_file_alt] [-x] [-y] [-tp tsv_file_path] [-r] [-j {None,nb_intersections_expectation_shuffled,nb_intersections_variance_shuffled,nb_intersections_negbinom_fit_quality,nb_intersections_log2_fold_change,nb_intersections_true,nb_intersections_pvalue,summed_bp_overlaps_expectation_shuffled,summed_bp_overlaps_variance_shuffled,summed_bp_overlaps_negbinom_fit_quality,summed_bp_overlaps_log2_fold_change,summed_bp_overlaps_true,summed_bp_overlaps_pvalue}]
[-a {None,nb_intersections_expectation_shuffled,nb_intersections_variance_shuffled,nb_intersections_negbinom_fit_quality,nb_intersections_log2_fold_change,nb_intersections_true,nb_intersections_pvalue,summed_bp_overlaps_expectation_shuffled,summed_bp_overlaps_variance_shuffled,summed_bp_overlaps_negbinom_fit_quality,summed_bp_overlaps_log2_fold_change,summed_bp_overlaps_true,summed_bp_overlaps_pvalue}] [-g pval_threshold] [-z] [-f] [-w] [-q] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
OLOGRAM -- OverLap Of Genomic Regions Analysis using Monte Carlo.
Ologram annotates peaks (in bed format) using (i) genomic features extracted from a GTF file (e.g.
promoter, gene body, UTR...), (ii) genomic regions tagged with particular keys/values in a GTF
(e.g. gene_biotype "protein_coding", gene_biotype "LncRNA", ...) or (iii) from a BED file
(user-defined regions).
Each set of regions is randomly shuffled independently across the chromosomes (inter-region
lengths are considered). Then the probability of intersection under the null hypothesis is
deduced with our Monte Carlo approach.
The null hypothesis is that the regions of the query (--peak-file) are located independently of
the reference (--inputfile or --more-bed) and of each other, and do not overlap more than by
chance. We return statistics for both the number of intersections and the total lengths (in
basepairs) of all intersections.
For more information, please see the full documentation.
OLOGRAM can also calculate enrichment for n-wise combinations (e.g. [Query + A + B] or [Query +
B + C]) on sets of regions defined by the user (--more-bed argument). Here is an example of
command line to compute the enrichments of the overlaps of the sets given in --more-bed, with
the query set (-p) and with each other:
`gtftk ologram -z -w -q -c hg38 -p query.bed --more-bed A.bed B.bed C.bed --more-bed-multiple-
overlap`
Author : Quentin FERRE <quentin.q.ferre@gmail.com>
Co-authors : Guillaume CHARBONNIER <guillaume.charbonnier@outlook.com> and Denis PUTHIER
<denis.puthier@univ-amu.fr>.
Notes:
* chrom-info may also accept 'mm8', 'mm9', 'mm10', 'hg19', 'hg38', 'rn3' or 'rn4'. In this
case the corresponding size of conventional chromosomes are used. To get the size of the
chromosome in ensembl format (whithout chr prefix), use 'mm8_ens', 'mm9_ens', 'mm10_ens',
'hg19_ens', 'hg38_ens', 'rn3_ens' or 'rn4_ens'. ChrM is not used.
* OLOGRAM is multithreaded, notably processing one batch of shuffles per core. This can be
RAM-intensive. If needed, use more minibatches and/or merge them with the ologram_merge_runs
command (not to be confused with ologram_merge_stats, which is simply a visual plugin).
* The program produces a pdf file and a tsv file ('_stats_') containing intersection
statistics for the shuffled BEDs under H0, giving the number of intersections (N) and total
number of overlapping base pairs (S). It gives for N and S mean and standard deviation (error
bars) in the shuffles compared to the actual values, as well as the p-value. It also gives the
goodness of fit of each statistic under (H0).
* You can exclude regions from the shuffling with --bed-incl. This is done by shuffling
across a concatenated "sub-genome" obtained by removing the excluded regions. The same ones
will be excluded from the peak file and the GTF/more-bed files.
* Use --no-gtf if you want to perform enrichment analysis on custom, focused annotations
only (--more-bed). Relatedly, use --no-basic-feature if you want to perform stats on GTF
keys (--more-key) but not on basic features (genes, transcripts, exons...).
* Support for multiple overlaps is available. If the --more-bed-multiple-overlap argument is
used, the query peak file will be compared with the custom regions passed to the --more-bed
argument, *and with them only*. For example, you can put as query the binding sites of the
Transcription Factor A, in --more-bed the factors B, C and D, and see whether A+B+D is an
enriched combination.
* By default, multiple overlap intersections are counted as "inexact", meaning an overlap of
[A + B + C] will be counted when looking for [A + B + ...]. Add the --exact argument to change
that.
* If you work on multiple overlaps, we recommend the ologram_modl_treeify plugin for
visualizations.
* Combinations of longer order (containing more sets) will usually be rarer and have lower
p-values; as such we recommend comparing p-values between combinations of the same order.
* P-values of -1 or NaN mean the Neg. Binom. fitting was poor, but that does not mean they
must always be discarded: in practice, this mostly happens for high order combinations which
are so unlikely that they were not encountered in the shuffles even once. In this case, this
would represent a very large enrichment! Tangentially, other potential causes of poor fit are:
the combination is too rare (too few/small regions) and is rarely encountered in the shuffles,
there are too few regions in the set (<200) or the shuffling was restricted to a too small
region and the variance is lower than mean.
* The Negative Binomial is an approximation, but differences only appear for very low
p-values, and even then order is conserved (if one feature is more enriched than another,
their p-values will be in the correct order, just slightly inflated). An ad-hoc beta fitted
p-value has been added instead if you wish, but it will only be more accurate than Neg Binom
if you do >10.000 shufffles at least. Empirical p-val is also accesible.
* Our model rests upon certain assumptions. The null hypothesis can be rejected if any
assumption is rejected. The fitting test is the key for that: if the fitting was good, we can
assume the combination is indeed enriched. Admittedly, the fitting test does not test the
tails, but shows if the general shape is close enough.
* Relatedly, in the output combinations are sorted by their true number of base pairs by
default, since combinations that are very rare even in the true data will likely have high
enrichment, but be less representative.
* Furthermore, you may use our MODL algorithm to find biological complexes of interest on the
intersections on the true data. This is done with the --multiple-overlap-max-number-of-
combinations argument. This will not change the enrichment result, but will restrict the
displayed combinations.
* If you manually specify the combinations to be studied with --multiple-overlap-custom-
combis, use the following format for the text file: The order is the same as --more-beds (ie.
if --more-bed is "A.bed B.bed C.bed", "1 0 1 1" means Query + B + C). Elements should be
whitespace separated, with one combination per line.
Arguments:
-i, --inputfile Path to the GTF file. Defaults to STDIN (default: <stdin>)
-c, --chrom-info Tabulated two-columns file. Chromosomes as column 1, sizes as column 2 (default: None)
-p, --peak-file The file containing the peaks/regions to be annotated. (bed format). (default: None)
-b, --more-bed A list of bed files to be considered as additional genomic annotations. (default: None)
-l, --more-bed-labels A comma separated list of labels (see --more-bed). Optional. (default: None)
-kis, --keep-intact-in-shuffling BETA - A comma separated list of sets number (starts at 0) to be kept intact (fixed) when shuffling, in the order given in --more-bed. (default: None)
-e, --bed-excl Exclusion file. The chromosomes will be shortened by this much for the shuffles of peaks and features. (bed format). (default: None)
-bi, --bed-incl Opposite of --bed-excl, will perform the same operation but keep only those regions. (default: None)
-u, --upstream Extend the TSS and TTS of in 5' by a given value. (default: 1000)
-d, --downstream Extend the TSS and TTS of in 3' by a given value. (default: 1000)
-m, --more-keys A comma separated list of key used for labeling the genome. See Notes. (default: None)
-n, --no-basic-feature No statistics for basic features of GTF. Concentrates on --more-bed and --more-keys. (default: False)
-mo, --more-bed-multiple-overlap The more-beds specified will be considered all at once for multiple overlaps. (default: False)
-mocs, --multiple-overlap-target-combi-size
Maximum number of sets in the output combinations. Default to -1 meaning no max number. (default: -1)
-monc, --multiple-overlap-max-number-of-combinations
Maximum number of combinations to consider by applying the MODL algorithm to the matrix of full overlaps. Defaults to -1, which means MODL is NOT applied and all combinations are returned (default: -1)
-moc, --multiple-overlap-custom-combis
Path to a text ('*.txt') file that will be read as a NumPy matrix, overriding the combinations to be studied. See notes for the format of the text file. (default: None)
-ex, --exact Determines whether an observations of A+B+C counts as an observation for A+B. If False (default), this does count. If True (the argument is present in the command line), we consider only exact matches. (default: False)
-k, --nb-threads Number of threads for multiprocessing. (default: 1)
-s, --seed Numpy random seed. (default: 42)
-mn, --minibatch-nb Number of minibatches of shuffles. (default: 10)
-ms, --minibatch-size Size of each minibatch, in number of shuffles. (default: 20)
-ma, --use-markov-shuffling Whether to use Markov model realisations (order 2) instead of independant shuffles. See notes. (default: False)
-o, --outputdir Output directory name. (default: ologram_output)
-pw, --pdf-width Output pdf file width (inches). (default: None)
-ph, --pdf-height Output pdf file height (inches). (default: None)
-pf, --pdf-file-alt Provide an alternative path for the main image. (default: None)
-x, --no-pdf Do not produce any image file. (default: False)
-y, --display-fit-quality Display the negative binomial fit quality on the diagrams. Also draws temporary file histograms for each combination (may take some time). (default: False)
-tp, --tsv-file-path Provide an alternative path for text output file. (default: None)
-r, --coord-flip The horizontal axis becomes vertical, and vertical becomes horizontal. (default: False)
-j, --sort-features Whether to sort features in diagrams according to a computed statistic. Default to sorting by total number of basepairs for this combination in the true data. (default: summed_bp_overlaps_true)
-a, --hide-undef Do not display combinations if this column has undef value (typically summed_bp_overlaps_pvalue). (default: None)
-g, --pval-threshold Hide combinations for which summed_bp_overlaps_pvalue is not lower or equal to --pval-threshold. (default: None)
-z, --no-gtf No GTF file is provided as input. (default: False)
-f, --force-chrom-gtf Discard silently, from GTF, genes outside chromosomes defined in --chrom-info. (default: False)
-w, --force-chrom-peak Discard silently, from --peak-file, peaks outside chromosomes defined in --chrom-info. (default: False)
-q, --force-chrom-more-bed Discard silently, from --more-bed files, regions outside chromosomes defined in --chrom-info. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
Manual intersection computing: To manually compute an overlap matrix between any number of BED files, the following Python code can be used.
import pybedtools
import numpy as np
from pygtftk.stats.intersect.overlap_stats_compute import compute_true_intersection
# Some example paths
QUERY_PATH = "./input/query.bed"
MORE_BED_PATHS = ["./input/A.bed", "./input/B.bed", "./input/C.bed"]
EXCL_PATH = "./exclusion.bed"
# Register the BED files as pybedtools.BedTool objects
bedA = pybedtools.BedTool(QUERY_PATH)
bedsB = [pybedtools.BedTool(bedfilepath).sort().merge() for bedfilepath in MORE_BED_PATHS] # Sort and merge for the bedsB
# OPTIONAL - Exclude some regions from the BEDs
bed_excl = pybedtools.BedTool(EXCL_PATH)
bedA = read_bed.exclude_concatenate(bedA, bed_excl)
bedsB = [read_bed.exclude_concatenate(bedB, bed_excl) for bedB in bedsB]
# Use our custom intersection computing algorithm to get the matrix of overlaps
true_intersection = compute_true_intersection(bedA, bedsB)
flags_matrix = np.array([i[3] for i in true_intersection])
# If desired, run MODL or any other algorithm on this
my_algorithm.process(flags_matrix)
# See code block above for a MODL example
The resulting flags_matrix is a NumPy array that can be edited, and on which MODL can be run. It is also possible to run any itemset miner you wish on this matrix. An implementation of apriori is provided in the pygtftk.stats.intersect.modl.apriori.Apriori class.
Note that by definition, in this intersections’ matrix only regions where at least two sets are open are given. Regions where a single set was open will not be present. If you want a matrix of all contiguous elements where at least one set is open, and not just intersections, you may opt to instead use as “query” (bedA) a BED file covering all the chromosomes in the genome (e.g. if your genome has only 2 chromosomes of length 100 each, this “query” file would be “chr1 0 100 n chr2 0 100”). To have predictable binning based on length in the final matrix instead of one line per intersection, you may also subdivide fake “query “chr1 0 100” region into bins of, say, 10 bp instead: “chr1 0 10 n chr1 11 20n …”.
Since the results of MODL only depend on the true intersections and not on the shuffles, you can run also MODL with 1 shuffle or on a manually computed matrix as above to pre-select interesting combinations, and then run the full analysis on many shuffles. We then recommend selecting the combinations that interest you in the resulting tsv file, using MODL’s selection as a starting point and adding or removing some combinations based on your own needs (eg. adding all the highest fold changes, or all particular combinations containing the Transcription Factor X that you are studying).
ologram_merge_stats¶
Description: Several tsv files resulting from OLOGRAM analyses can be merged into a single diagram report using the merge_ologram_stats.
Example: For this example we will used the results obtained for 3 epigenetic marks on human chromosome 1.
$ gtftk ologram_merge_stats H3K4me3_ologram_stats.tsv H3K36me3_ologram_stats.tsv H3K79me2_ologram_stats.tsv -o merge_ologram_stats_01.pdf --labels H3K4me3,H3K36me3,H3K79me2
This also works with OLOGRAM-MODL results, since they follow the same basic format of one element/combination per line.
Cases without a p-value diamond mean it was NaN. It usually means was too rare to be encountered in the shuffles.
An example of use case for this tool would be to compare between different cell lines, or to slop (extend) your query regions by different lengths and compare the enrichment to find at which distance of each other several sets are on average.
Arguments:
$ gtftk ologram_merge_stats -h
Usage: gtftk ologram_merge_stats [-pw pdf_width] [-ph pdf_height] -o output [-l labels] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file] inputfiles [inputfiles ...]
Description:
Merge a set of OLOGRAM outputs into a single output. Build a heatmap from the results.
See the /pygtftk/plugins/ologram.py file, as well as the documentation, for more information about
OLOGRAM.
Notes:
* By default, labels in the diagram are derived from the name of the enclosing folder. E.g. if
file is a/b/c/00_ologram_stats.tsv, 'c' will be used as label.
* Otherwise use --labels to set the labels.
* Squares without a diamond mean the p-value was NaN due to poor fitting. This is mostly the
case for higher-order combis in multiple overlaps that were so rare that they are not
encountered in the shuffles.
Arguments:
inputfiles Complete paths to the OLOGRAM output files
-pw, --pdf-width Output pdf file width (inches). (default: None)
-ph, --pdf-height Output pdf file height (inches). (default: None)
-o, --output Pdf file name. (default: None)
-l, --labels A comma separated list of labels. (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
ologram_modl_treeify¶
Description: Visualize n-wise enrichment results (OLOGRAM-MODL) as a tree of combinations. Works on the result (tsv file) of an OLOGRAM analysis called with –more-bed-multiple-overlap. On the graph, S designated the total number of basepairs in which this combinations is encountered in the real data. Fold change gives the ratio with the number of basepairs in the shuffles, with the associated Negative Binomial p-value.
This recommended representation is useful to find master regulators, by showing which additions to a combinations increase its enrichment, and allowing to see whether overlaps that contain the element X also contain the element Y (looking at how a child combination accounts for the S of its parent in an inexact counting).
P-values of NaN (-1 in the original tsv) are due to poor fitting. They are mostly present in high order combinations, that were so rare that they are not encountered in the shuffles even once. We also recommend discarding the rarest combinations found on such a very small number of basepairs that they are unlikely to be biologically significant. This is mostly relevant when you have many sets (k >= 5) since longer combinations will often be enriched through sheer unlikelihood. To that effect, there is a parameter to display only the combinations with the highest S.
The tsv result file can be edited before passing it to the command, for example by keeping only the combinations you are interested in. You can either (1) run OLOGRAM-MODl with no filtering and get a tree of all combinations, (2) use MODL to get a pre-selection that can be tailored, or (3) take the run with all combinations from the possibility 1 and use the -t argument to take the most frequent combinations.
$ gtftk ologram_modl_treeify -i multiple_overlap_trivial_ologram_stats.tsv -o treeified.pdf -l ThisWasTheNameOfTheQuery
$ gtftk ologram_modl_treeify -h
Usage: gtftk ologram_modl_treeify -i inputfile -o output [-l query_label] [-t top_s] [-mh min_inheritance] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Turns a result of OLOGRAM-MODL multiple overlap (tsv file) in a tree for easier visualisation
See the /pygtftk/plugins/ologram.py file, as well as the documentation, for more information about
OLOGRAM.
Notes:
* Turns a result of OLOGRAM-MODL multiple overlap (tsv file) in a tree for easier
visualisation.
* This is the preferred representation for OLOGRAM-MODL results. Each node represents a
combination, with its number of overlapping basepairs in true data (S) and the corresponding
fold change and p-value compared to the shuffles.
* Result tsv files can be manually edited (ie. removing combinations) before passing them to
this plugin
* For a quick filtering, it is possible to show only the top T combinations sorted by total
basepairs in real data.
Arguments:
-i, --inputfile Complete path to the OLOGRAM output file (default: None)
-o, --output Pdf file name (default: None)
-l, --query-label Name of the query for display (default: Query)
-t, --top-s Optional. Only the top t combinations sorted by total basepairs in real data will be displayed. (default: -1)
-mh, --min_inheritance Optional. A combination will be added to a shorter parent in the tree only if it represents at least a proportion MH of its true base pairs (between 0 and 1). (default: -1)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
ologram_merge_runs¶
Description: Merge several runs of OLOGRAM into a single run, by treating each a “superbatch” of shuffles.
OLOGRAM remembers all intersections occuring inside all minibatches, so as to calculate statistics. If you are using a large number of shuffles and/or very large files, this may cost a lot of RAM. In practice, you will seldom need more than 100-200 shuffles. But optionally, if you require increased precision, you can run OLOGRAM several times, treat each run as a “batch of batches” and merge and recalculate stats on the merged superbatch automatically using this command.
Around 100-200 shuffles is usually enough to robustly fit a Negative Binomial distribution. In terms of precision, a Negative Binomial mean under 1/100 (meaning this combination was not seen at least once in 100 shuffles) would not mean much anyways.
# Make several OLOGRAM runs
N_RUNS = 100
for i in {1..$N_RUNS}
do
gtftk ologram ... # Replacing this with a complete OLOGRAM command
done
# Merge those runs
gtftk ologram_merge_runs --inputfiles `ls ./results/*.tsv` -o ./merged_batches_result.tsv -V 3
Other commands such as ologram_modl_treeify can now be called on the resulting tsv, which respects the OLOGRAM format.
$ gtftk ologram_merge_runs -h
Usage: gtftk ologram_merge_runs [-i inputfiles [inputfiles ...]] [-os ori_shuffles] -o output [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Merge a set of OLOGRAM runs into a single run and recalculates statistics based on it.
This treats each run as a "superbatch". The command takes as input the list of paths of all the
results' TSV you wish to merge. It also takes as input the number of shuffles originally
performed in the individual runs (by default, is assumed to be 200).
Example of command line: gtftk ologram_merge_runs --inputfiles `ls output/ologram_results/*.tsv`
--ori-shuffles 200 -o final_result.tsv
Notes:
* This implicitly assumes you are combining runs with the *same number* of shuffles in each.
* The fit quality for the Neg. Binoms. will be indicated as "-1" since it cannot be evaluated
here.
* On the technical side, statistics are recalculated by conflating the distributions with a
weighting based on the number of runs. See the source code for the precise formula.
Arguments:
-i, --inputfiles Complete paths to the OLOGRAM output text files (default: None)
-os, --ori-shuffles How many shuffles were performed in the individual runs that will be merged? (default: 200)
-o, --output Destination path for the merged output file. (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
Notes¶
This section contains more specific notes about the use and interpetation of OLOGRAM.
– The goal of the minibatches is to save RAM. You should increase the number of minibatches, instead of their size.
– If --more-keys is used additional region sets will be tested based on the associated key value. As an example, if --more-keys is set to the ‘gene_biotype’ (a key generally found in ensembl GTF), the region related to ‘protein_coding’, ‘lncRNA’ or any other values for that key will be retrieved merged and tested for enrichment.
– For statistical reality reasons, with multiple sets the expected overlaps for the longer combinations (A+B+C+D+… when they are all independant) can be very low. As a result, longer combinations tend to be more enriched: this should be kept in mind when comparing enrichment values between combinations of a different order.
- This is especially true when the total genomic coverage of the sets is low. We recommend instead shuffling only across a biologically relevant subsection of the genome (with --bed-incl)for example, if studying Transcriptional Regulators, shuffling only on inferred Cis Regulatory Modules or promoters.
If the shuffling is restricted to a sub-genome, and features outside are discarded. In essence it mostly means switching to a smaller genome. Of course, since the shuffling is done only here, (H_0) becomes ‘… the features are independent and can only be located on the sub-genome’. This bears mentioning. In practice, this means shuffling only across shortened chromosomes.
Our Negative Binomial model helps alleviate this problem. Even so, if a combination is so rare that it is not encoutered even once in the shuffles, it will have a p-value of NaN. Furthermore, if C is depleted with query but always present with A and B, and A and B are enriched themselves, A+B+C will be enriched.
– BETA - When using –more-bed (and only that), you can give a list of bed files that should be kept fixated during the shuffles using the –keep-intact-in-shuffling argument.
– RAM will be the biggest limiting factor. While 100 total shuffles should be enough to fit a Negative Binomial distribution in most cases, if needed try running more batches of fewer shuffles instead of the other way around. The alternative is running them independantly and merging them afterwards with ologram_merge_runs.
– If you have many (30+) BED files in –more-bed, consider running a pairwise analysis first to divide them in groups of 10-20, and study the multiple overlaps within those groups. This is also more likely to be biologically significant, as for example Transcription Factor complexes usually have 2-8 members.
– We recommend running the ologram_modl_treeify plugin on the resulting tsv file if you use multiple overlaps.
– Our Negative Binomial model is only an approximation for the underlying true distribution, which is likely close to a Beta Binomial. For instance, the Neg. Binom. approximation fails with too few regions in the sets (at least 1K), and will likely slightly overestimate the p-values in other cases. However, precision is usually good for even very significant p-values, dropping only at the very significant level (<1E-5), hence there is only a very small risk of false positives. Also, even if they are overestimated, the order of p-values is unchanged (as a Neg. Binom. is a special case of Beta) meaning if a combination 1 has a higher Neg. Binom. p-value than combination 2, its true p-value is also likely higher than the p-value of 2.
The Neg. Binom. is still the better option, as fitting the proper distribution (approximated as Beta) is more difficult. As such, an ad-hoc p-value based on the Beta distribution is given, but it will only better than the Neg. Binom. on massive numbers of shuffles (thousands). We also added the empirical p-value as a new column (ie. number of shuffles in which a value as extreme is observed) if you believe the model to be inadequate.
– Our model rests upon certain assumptions (ie. exchangeable variables, sufficient nb. of regions, etc.). The null hypothesis can be rejected if any assumption is rejected, or merely because the approximation holds only asymptotically. The fitting test is the key for that: if, when performing the shuffles, it is found that the distribution of S under our shuffling model does not follow a Neg. Binom., it will be said. Then if the hypothesis is rejected (low p-val) but the fitting was good, it is then reasnobale to assume the combination is enriched. Admittedly, the fitting test does not test the tails of the distribution, but it shows if the general shape is close enough.
OLOGRAM file structure¶
Below is a detailed list of the source code files of OLOGRAM-MODL, with their roles. All paths are relative to the root of the pygtftk Git repository. The “Plugin” group designates plugins that can be called directly from the command line. A file extension of “pyx” designates a Cython file, “py” a Python file, and “cpp” a C++ file.
Group |
File path |
Description |
---|---|---|
Plugin |
pygtftk/plugins/ologram.py |
Root file. Parses the arguments and calls the other functions. |
Utility |
docs/source/ologram.rst |
Documentation source. |
Root |
pygtftk/stats/intersect/overlap_stats_shuffling.py |
Main function. Called directly by the ologram.py plugin. All other functions calls are descended from this one. |
Root |
pygtftk/stats/intersect/overlap_stats_compute.py |
Functions to compute overlap statistics on (shuffled) region sets. |
Algorithm |
pygtftk/stats/intersect/create_shuffles.pyx |
Shuffle BED files and generate new “fake” BED files. |
Algorithm |
pygtftk/stats/intersect/overlap/overlap_regions.pyx |
Compute the overlaps between two sets of genomic regions, supporting multiple overlaps. |
Utility |
Turn a BED file into a list of intervals. |
pygtftk/stats/intersect/read_bed/read_bed_as_list.pyx |
Utility |
pygtftk/stats/intersect/read_bed/exclude.cpp |
Exclude certain regions from a set to create concatenated sub-chromosomes. |
Utility |
pygtftk/stats/multiprocessing/multiproc.pyx |
Helper functions and structures for multiprocessing. |
Statistics |
pygtftk/stats/negbin_fit.py |
Utility functions relative to the negative binomial distribution, including verifying its good fit. |
MODL |
pygtftk/stats/intersect/modl/dict_learning.py |
Contains the MODL algorithm, an itemset mining algorithm described in this paper. |
MODL |
pygtftk/stats/intersect/modl/subroutines.py |
Subroutines of the MODL algorithm. Those are pure functions and can be used independently. |
Utility |
pygtftk/stats/intersect/modl/tree.py |
A graph-based representation of combinations of elements. |
Plugin |
pygtftk/plugins/ologram_merge_runs.py |
Merge a set of OLOGRAM runs into a single run and recalculates statistics based on it. |
Plugin |
pygtftk/plugins/ologram_merge_stats.py |
Merge a set of OLOGRAM outputs calculated on different queries into a single output, preserving labels. Build a heatmap from the results. |
Plugin |
pygtftk/plugins/ologram_modl_treeify.py |
Turns a result of OLOGRAM-MODL multiple overlap (tsv file) in a tree for easier visualisation. |