Commands from section ‘annotation’¶
In the example of this section we will need the following example files:
$ gtftk get_example -q -d simple -f '*'
$ gtftk get_example -q -d mini_real -f '*'
$ gtftk get_example -q -d hg38_chr1 -f '*'
closest_genes¶
Description: Find the n closest genes for each transcript.
Example:
$ gtftk closest_genes -i simple.gtf -f
genes closest_genes distances
G0001 G0007 18
G0002 G0010 4
G0003 G0004 4
G0004 G0003 4
G0005 G0006 12
G0006 G0005 12
G0007 G0001 18
G0008 G0002 42
G0009 G0006 21
G0010 G0002 4
Arguments:
$ gtftk closest_genes -h
Usage: gtftk closest_genes [-i GTF] [-o GTF/TXT] [-r {tss,tts,gene}] [-nb nb_neighbors] [-t {tss,tts,gene}] [-s] [-S] [-f] [-H] [-k] [-id {gene_id,gene_name}] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Find the n closest genes for each genes.
Notes:
* The reference region for each gene can be the TSS (the most 5'), the TTS (The most 3') or
the whole gene.
* The reference region for each closest gene can be the TSS, the whole gene or the TTS.
* The closest genes can be searched in a stranded or unstranded fashion.
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
Arguments:
-r, --from-region-type What is region to consider for each gene. (default: tss)
-nb, --nb-neighbors The size of the neighborhood. (default: 1)
-t, --to-region-type What is region to consider for each closest gene. (default: tss)
-s, --same-strandedness Require same strandedness (default: False)
-S, --diff-strandedness Require different strandedness (default: False)
-f, --text-format Return a text format. (default: False)
-H, --no-header Don't print the header line. (default: False)
-k, --collapse Unwrap. Don't use comma. Print closest genes line by line. (default: False)
-id, --identifier The key used as gene identifier. (default: gene_id)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
overlapping¶
Description: Find transcripts whose body/TSS/TTS region extended in 5’ and 3’ (-u/-d) overlaps with any transcript from another gene. Strandness is not considered by default. Used –invert-match to find those that do not overlap. If –annotate-gtf is used, all lines of the input GTF file will be printed and a new key containing the list of overlapping transcripts will be added to the transcript features/lines (key will be ‘overlapping_*’ with * one of body/TSS/TTS). The –annotate-gtf and –invert-match arguments are mutually exclusive.
Example: Find transcript whose promoter overlap transcript from other genes.
$ gtftk overlapping -i simple.gtf -c simple.chromInfo -t promoter -u 10 -d 10 -a | gtftk select_by_key -k feature -v transcript | gtftk tabulate -k transcript_id,overlap_promoter_u0.01k_d0.01k | head
transcript_id overlap_promoter_u0.01k_d0.01k
G0001T002 G0007T001,G0007T002
G0001T001 G0007T001,G0007T002
G0002T001 G0010T001
G0003T001 G0004T002,G0004T001
G0004T002 G0003T001
G0004T001 G0003T001
G0005T001 G0003T001
G0006T001 G0005T001
G0006T002 G0005T001
Arguments:
$ gtftk overlapping -h
Usage: gtftk overlapping [-i GTF] [-o GTF] -c CHROMINFO [-u UPSTREAM] [-d DOWNSTREAM] [-t {transcript,promoter,tts}] [-s] [-S] [-n] [-a] [-k key_name] [-b] [-@] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Find transcripts whose body/TSS/TTS region extended in 5' and 3' (-u/-d) overlaps with any
transcript from another gene. Strandness is not considered by default. Used --invert-match to
find those that do not overlap. If --annotate-gtf is used, all lines of the input GTF file will
be printed and a new key containing the list of overlapping transcripts will be added to the
transcript features/lines (key will be 'overlapping_*' with * one of body/TSS/TTS). The
--annotate-gtf and --invert-match arguments are mutually exclusive.
Notes:
* --chrom-info may also accept 'mm8', 'mm9', 'mm10', 'hg19', 'hg38', 'rn3' or 'rn4'. In this
case the corresponding size of conventional chromosomes are used. ChrM is not used.
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-c, --chrom-info Chromosome information. A tabulated two-columns file with chromosomes as column 1 and sizes as column 2 (default: None)
-u, --upstream Extend the region in 5'. Used to define the region around the TSS/TTS. (default: 1500)
-d, --downstream Extend the region in 3'. Used to define the region around the TSS/TTS. (default: 1500)
-t, --feature-type The feature of interest. (default: transcript)
-s, --same-strandedness Require same strandedness (default: False)
-S, --diff-strandedness Require different strandedness (default: False)
-n, --invert-match Not/Invert match. (default: False)
-a, --annotate-gtf All lines of the original GTF will be printed. (default: False)
-k, --key-name The name of the key. (default: None)
-b, --bool When --annotate-gtf is used use 0 or 1 as key values (instead of overlapping transcripts id). (default: False)
-@, --annotate-all When --annotate-gtf annotate all transcripts (default value would be '0'). (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
divergent¶
Description: Find transcript with divergent promoters. These transcripts will be defined here as those whose promoter region (defined by -u/-d) overlaps with the tss of another gene in reverse/antisens orientation. This may be useful to select coding genes in head-to-head orientation or LUAT as described in “Divergent transcription is associated with promoters of transcriptional regulators” (Lepoivre C, BMC Genomics, 2013). The output is a GTF with an additional key (‘divergent’) whose value is set to ‘.’ if the gene has no antisens transcript in its promoter region. If the gene has an antisens transcript in its promoter region the ‘divergent’ key is set to the identifier of the transcript whose tss is the closest relative to the considered promoter. The tss to tss distance is also provided as an additional key (dist_to_divergent).
Example: Flag divergent transcripts in the example dataset. Select them and produce a tabulated output.
$ gtftk divergent -i simple.gtf -c simple.chromInfo -u 10 -d 10| gtftk select_by_key -k feature -v transcript | gtftk tabulate -k transcript_id,divergent,dist_to_divergent | head -n 7
transcript_id divergent dist_to_divergent
G0003T001 G0004T002 4
G0004T002 G0003T001 4
G0004T001 G0003T001 4
Arguments:
$ gtftk divergent -h
Usage: gtftk divergent [-i GTF] [-o GTF] -c CHROMINFO [-u UPSTREAM] [-d DOWNSTREAM] [-n] [-S] [-a key_name] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Find transcripts with divergent promoters. These transcripts will be defined here as those whose
promoter region (defined by -u/-d) overlaps with the tss of another gene in reverse/antisens
orientation. This may be useful to select coding genes in head-to-head orientation or LUAT as
described in "Divergent transcription is associated with promoters of transcriptional
regulators" (Lepoivre C, BMC Genomics, 2013). The output is a GTF with an additional key
('divergent') whose value is set to '.' if the gene has no antisens transcript in its promoter
region. If the gene has an antisens transcript in its promoter region the 'divergent' key is
set to the identifier of the transcript whose tss is the closest relative to the considered
promoter. The tss to tss distance is also provided as an additional key (dist_to_divergent).
Notes:
* chrom-info may also accept 'mm8', 'mm9', 'mm10', 'hg19', 'hg38', 'rn3' or 'rn4'. In this
case the corresponding size of conventional chromosomes are used. To get the size of the
chromosome in ensembl format (whithout chr prefix), use 'mm8_ens', 'mm9_ens', 'mm10_ens',
'hg19_ens', 'hg38_ens', 'rn3_ens' or 'rn4_ens'. ChrM is not used.
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-c, --chrom-info Tabulated two-columns file. Chromosomes as column 1 and their sizes as column 2 (default: None)
-u, --upstream Extend the promoter in 5' by a given value (int). Defines the region around the tss. (default: 1500)
-d, --downstream Extend the region in 3' by a given value (int). Defines the region around the tss. (default: 1500)
-n, --no-annotation Do not annotate the GTF. Just select the divergent transcripts. (default: False)
-S, --no-strandness Do not consider strandness (only look whether the promoter from a transcript overlaps with the promoter from another gene). (default: False)
-a, --key-name The name of the key. (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
convergent¶
Description: Find transcript with convergent tts. These transcripts will be defined here as those whose tts region (defined by -u/-d) overlaps with the tts of another gene in reverse/antisens orientation. The output is a GTF with an additional key (‘convergent’) whose value is set to ‘.’ if the gene has no convergent transcript in its tts region. If the gene has an antisens transcript in its tts region the ‘convergent’ key is set to the identifier of the transcript whose tts is the closest relative to the considered tts. The tts to tts distance is also provided as an additional key (dist_to_convergent).
Example: Flag divergent transcripts in the example dataset. Select them and produce a tabulated output.
$ gtftk convergent -i simple.gtf -c simple.chromInfo -u 25 -d 25| gtftk select_by_key -k feature -v transcript | gtftk tabulate -k transcript_id,convergent,dist_to_convergent| head -n 4
transcript_id convergent dist_to_convergent
G0002T001 G0008T001 21
G0008T001 G0002T001 21
G0010T001 G0008T001 24
Arguments:
$ gtftk convergent -h
Usage: gtftk convergent [-i GTF] [-o GTF] -c CHROMINFO [-u UPSTREAM] [-d DOWNSTREAM] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Find transcripts with convergent tts. These transcripts will be defined here as those whose tts
region (defined by -u/-d) overlaps with the tts of another gene in reverse/antisens
orientation. The output is a GTF with an additional key ('convergent') whose value is set to
'.' if the gene has no convergent transcript in its tts region. If the gene has an antisens
transcript in its tts region the 'convergent' key is set to the identifier of the transcript
whose tts is the closest relative to the considered tts. The tts to tts distance is also
provided as an additional key (dist_to_convergent).
Notes:
* chrom-info may also accept 'mm8', 'mm9', 'mm10', 'hg19', 'hg38', 'rn3' or 'rn4'. In this
case the corresponding size of conventional chromosomes are used. To get the size of the
chromosome in ensembl format (whithout chr prefix), use 'mm8_ens', 'mm9_ens', 'mm10_ens',
'hg19_ens', 'hg38_ens', 'rn3_ens' or 'rn4_ens'. ChrM is not used.
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-c, --chrom-info Tabulated two-columns file. Chromosomes as column 1 and sizes as column 2 (default: None)
-u, --upstream Extends the tts in 5' by a given value (int). Defines the region around the tts. (default: 1500)
-d, --downstream Extends the region in 3' by a given value (int). Defines the region around the tts. (default: 1500)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
exon_sizes¶
Description: Add a new key to transcript features containing a comma-separated list of exon sizes.
Example:
$ gtftk exon_sizes -i simple.gtf | gtftk select_by_key -t | gtftk tabulate -k transcript_id,exon_sizes
transcript_id exon_sizes
G0001T002 14
G0001T001 14
G0002T001 10
G0003T001 5,5
G0004T002 4,1,3
G0004T001 4,1,3
G0005T001 6,3
G0006T001 3,3,4
G0006T002 3,3
G0007T001 10
G0007T002 10
G0008T001 3,5
G0009T002 12
G0009T001 12
G0010T001 11
Arguments:
$ gtftk exon_sizes -h
Usage: gtftk exon_sizes [-i GTF] [-o TXT] [-a key_name] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Add a new key to transcript features containing a comma-separated list of exon sizes.
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN. (default: <stdin>)
-o, --outputfile Output GTF file. (default: <stdout>)
-a, --key-name The name of the key. (default: exon_sizes)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
intron_sizes¶
Description: Add a new key to transcript features containing a comma-separated list of intron sizes.
Example:
$ gtftk intron_sizes -i simple.gtf | gtftk select_by_key -t | gtftk tabulate -k transcript_id,intron_sizes
transcript_id intron_sizes
G0001T002 0
G0001T001 0
G0002T001 0
G0003T001 2
G0004T002 2,2
G0004T001 2,2
G0005T001 6
G0006T001 2,2
G0006T002 2
G0007T001 0
G0007T002 0
G0008T001 5
G0009T002 0
G0009T001 0
G0010T001 0
Arguments:
$ gtftk intron_sizes -h
Usage: gtftk intron_sizes [-i GTF] [-o GTF] [-a key_name] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Add a new key to transcript features containing a comma-separated list of intron-size.
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN. (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-a, --key-name The name of the key. (default: intron_sizes)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)