Commands from section ‘annotation’

In the example of this section we will need the following example files:

$ gtftk get_example -q -d simple -f '*'
$ gtftk get_example -q -d mini_real -f '*'
$ gtftk get_example -q -d hg38_chr1 -f '*'

closest_genes

Description: Find the n closest genes for each transcript.

Example:

$ gtftk closest_genes  -i simple.gtf -f
genes	closest_genes	distances
G0001	G0007	18
G0002	G0010	4
G0003	G0004	4
G0004	G0003	4
G0005	G0006	12
G0006	G0005	12
G0007	G0001	18
G0008	G0002	42
G0009	G0006	21
G0010	G0002	4

Arguments:

$ gtftk closest_genes -h
  Usage: gtftk closest_genes [-i GTF] [-o GTF/TXT] [-r {tss,tts,gene}] [-nb nb_neighbors] [-t {tss,tts,gene}] [-s] [-S] [-f] [-H] [-k] [-id {gene_id,gene_name}] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Find the n closest genes for each genes.

  Notes:
     *  The reference region for each gene can be the TSS (the most 5'), the TTS (The most 3') or
     the whole gene.
     *  The reference region for each closest gene can be the TSS, the whole gene or the TTS.
     *  The closest genes can be searched in a stranded or unstranded fashion.

optional arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)

Arguments:
 -r, --from-region-type       What is region to consider for each gene. (default: tss)
 -nb, --nb-neighbors          The size of the neighborhood. (default: 1)
 -t, --to-region-type         What is region to consider for each closest gene. (default: tss)
 -s, --same-strandedness      Require same strandedness (default: False)
 -S, --diff-strandedness      Require different strandedness (default: False)
 -f, --text-format            Return a text format. (default: False)
 -H, --no-header              Don't print the header line. (default: False)
 -k, --collapse               Unwrap. Don't use comma. Print closest genes line by line. (default: False)
 -id, --identifier            The key used as gene identifier. (default: gene_id)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

overlapping

Description: Find transcripts whose body/TSS/TTS region extended in 5’ and 3’ (-u/-d) overlaps with any transcript from another gene. Strandness is not considered by default. Used –invert-match to find those that do not overlap. If –annotate-gtf is used, all lines of the input GTF file will be printed and a new key containing the list of overlapping transcripts will be added to the transcript features/lines (key will be ‘overlapping_*’ with * one of body/TSS/TTS). The –annotate-gtf and –invert-match arguments are mutually exclusive.

Example: Find transcript whose promoter overlap transcript from other genes.

$ gtftk overlapping -i simple.gtf -c simple.chromInfo -t promoter -u 10 -d 10 -a    | gtftk select_by_key -k feature -v transcript | gtftk tabulate -k transcript_id,overlap_promoter_u0.01k_d0.01k | head
transcript_id	overlap_promoter_u0.01k_d0.01k
G0001T002	G0007T001,G0007T002
G0001T001	G0007T001,G0007T002
G0002T001	G0010T001
G0003T001	G0004T002,G0004T001
G0004T002	G0003T001
G0004T001	G0003T001
G0005T001	G0003T001
G0006T001	G0005T001
G0006T002	G0005T001

Arguments:

$ gtftk overlapping -h
  Usage: gtftk overlapping [-i GTF] [-o GTF] -c CHROMINFO [-u UPSTREAM] [-d DOWNSTREAM] [-t {transcript,promoter,tts}] [-s] [-S] [-n] [-a] [-k key_name] [-b] [-@] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Find transcripts whose body/TSS/TTS region extended in 5' and 3' (-u/-d) overlaps with any
     transcript from another gene. Strandness is not considered by default. Used --invert-match to
     find those that do not overlap. If --annotate-gtf is used, all lines of the input GTF file will
     be printed and a new key containing the list of overlapping transcripts will be added to the
     transcript features/lines (key will be 'overlapping_*' with * one of body/TSS/TTS). The
     --annotate-gtf and --invert-match arguments are mutually exclusive.

  Notes:
     *  --chrom-info may also accept 'mm8', 'mm9', 'mm10', 'hg19', 'hg38', 'rn3' or 'rn4'. In this
     case the  corresponding size of conventional chromosomes are used. ChrM is not used.

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -c, --chrom-info             Chromosome information. A tabulated two-columns file with chromosomes as column 1 and sizes as column 2 (default: None)
 -u, --upstream               Extend the region in 5'. Used to define the region around the TSS/TTS. (default: 1500)
 -d, --downstream             Extend the region in 3'. Used to define the region around the TSS/TTS. (default: 1500)
 -t, --feature-type           The feature of interest. (default: transcript)
 -s, --same-strandedness      Require same strandedness (default: False)
 -S, --diff-strandedness      Require different strandedness (default: False)
 -n, --invert-match           Not/Invert match. (default: False)
 -a, --annotate-gtf           All lines of the original GTF will be printed. (default: False)
 -k, --key-name               The name of the key. (default: None)
 -b, --bool                   When --annotate-gtf is used use 0 or 1 as key values (instead of overlapping transcripts id). (default: False)
 -@, --annotate-all           When --annotate-gtf annotate all transcripts (default value would be '0'). (default: False)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

divergent

Description: Find transcript with divergent promoters. These transcripts will be defined here as those whose promoter region (defined by -u/-d) overlaps with the tss of another gene in reverse/antisens orientation. This may be useful to select coding genes in head-to-head orientation or LUAT as described in “Divergent transcription is associated with promoters of transcriptional regulators” (Lepoivre C, BMC Genomics, 2013). The output is a GTF with an additional key (‘divergent’) whose value is set to ‘.’ if the gene has no antisens transcript in its promoter region. If the gene has an antisens transcript in its promoter region the ‘divergent’ key is set to the identifier of the transcript whose tss is the closest relative to the considered promoter. The tss to tss distance is also provided as an additional key (dist_to_divergent).

Example: Flag divergent transcripts in the example dataset. Select them and produce a tabulated output.

$ gtftk divergent -i simple.gtf -c simple.chromInfo -u 10 -d 10| gtftk select_by_key -k feature -v transcript | gtftk tabulate -k transcript_id,divergent,dist_to_divergent | head  -n 7
transcript_id	divergent	dist_to_divergent
G0003T001	G0004T002	4
G0004T002	G0003T001	4
G0004T001	G0003T001	4

Arguments:

$ gtftk divergent -h
  Usage: gtftk divergent [-i GTF] [-o GTF] -c CHROMINFO [-u UPSTREAM] [-d DOWNSTREAM] [-n] [-S] [-a key_name] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Find transcripts with divergent promoters. These transcripts will be defined here as those whose
     promoter region (defined by -u/-d) overlaps with the tss of another gene in reverse/antisens
     orientation. This may be useful to select coding genes in head-to-head orientation or LUAT as
     described in "Divergent transcription is associated with promoters of transcriptional
     regulators" (Lepoivre C, BMC Genomics, 2013). The output is a GTF with an additional key
     ('divergent') whose value is set to '.' if the gene has no antisens transcript in its promoter
     region. If the gene has an antisens transcript in its promoter region the 'divergent' key is
     set to the identifier of the transcript whose tss is the closest relative to the considered
     promoter. The tss to tss distance is also provided as an additional key (dist_to_divergent).

  Notes:
     *  chrom-info may also accept 'mm8', 'mm9', 'mm10', 'hg19', 'hg38', 'rn3' or 'rn4'. In this
     case the  corresponding size of conventional chromosomes are used. To get the size of  the
     chromosome in ensembl format (whithout chr prefix), use 'mm8_ens', 'mm9_ens',  'mm10_ens',
     'hg19_ens', 'hg38_ens', 'rn3_ens' or 'rn4_ens'. ChrM is not used.

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -c, --chrom-info             Tabulated two-columns file. Chromosomes as column 1 and their sizes as column 2 (default: None)
 -u, --upstream               Extend the promoter in 5' by a given value (int). Defines the region around the tss. (default: 1500)
 -d, --downstream             Extend the region in 3' by a given value (int). Defines the region around the tss. (default: 1500)
 -n, --no-annotation          Do not annotate the GTF. Just select the divergent transcripts. (default: False)
 -S, --no-strandness          Do not consider strandness (only look whether the promoter from a transcript overlaps with the promoter from another gene). (default: False)
 -a, --key-name               The name of the key. (default: None)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

convergent

Description: Find transcript with convergent tts. These transcripts will be defined here as those whose tts region (defined by -u/-d) overlaps with the tts of another gene in reverse/antisens orientation. The output is a GTF with an additional key (‘convergent’) whose value is set to ‘.’ if the gene has no convergent transcript in its tts region. If the gene has an antisens transcript in its tts region the ‘convergent’ key is set to the identifier of the transcript whose tts is the closest relative to the considered tts. The tts to tts distance is also provided as an additional key (dist_to_convergent).

Example: Flag divergent transcripts in the example dataset. Select them and produce a tabulated output.

$ gtftk convergent -i simple.gtf -c simple.chromInfo -u 25 -d 25| gtftk select_by_key -k feature -v transcript | gtftk tabulate -k transcript_id,convergent,dist_to_convergent| head -n 4
transcript_id	convergent	dist_to_convergent
G0002T001	G0008T001	21
G0008T001	G0002T001	21
G0010T001	G0008T001	24

Arguments:

$ gtftk convergent -h
  Usage: gtftk convergent [-i GTF] [-o GTF] -c CHROMINFO [-u UPSTREAM] [-d DOWNSTREAM] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Find transcripts with convergent tts. These transcripts will be defined here as those whose tts
     region (defined by -u/-d) overlaps with the tts of another gene in reverse/antisens
     orientation. The output is a GTF with an additional key ('convergent') whose value is set to
     '.' if the gene has no convergent transcript in its tts region. If the gene has an antisens
     transcript in its tts region the 'convergent' key is set to the identifier of the transcript
     whose tts is the closest relative to the considered tts. The tts to tts distance is also
     provided as an additional key (dist_to_convergent).

  Notes:
     *  chrom-info may also accept 'mm8', 'mm9', 'mm10', 'hg19', 'hg38', 'rn3' or 'rn4'. In this
     case the  corresponding size of conventional chromosomes are used. To get the size of  the
     chromosome in ensembl format (whithout chr prefix), use 'mm8_ens', 'mm9_ens',  'mm10_ens',
     'hg19_ens', 'hg38_ens', 'rn3_ens' or 'rn4_ens'. ChrM is not used.

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -c, --chrom-info             Tabulated two-columns file. Chromosomes as column 1 and sizes as column 2 (default: None)
 -u, --upstream               Extends the tts in 5' by a given value (int). Defines the region around the tts. (default: 1500)
 -d, --downstream             Extends the region in 3' by a given value (int). Defines the region around the tts. (default: 1500)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

exon_sizes

Description: Add a new key to transcript features containing a comma-separated list of exon sizes.

Example:

$ gtftk exon_sizes -i simple.gtf | gtftk select_by_key -t | gtftk tabulate -k transcript_id,exon_sizes
transcript_id	exon_sizes
G0001T002	14
G0001T001	14
G0002T001	10
G0003T001	5,5
G0004T002	4,1,3
G0004T001	4,1,3
G0005T001	6,3
G0006T001	3,3,4
G0006T002	3,3
G0007T001	10
G0007T002	10
G0008T001	3,5
G0009T002	12
G0009T001	12
G0010T001	11

Arguments:

$ gtftk exon_sizes -h
  Usage: gtftk exon_sizes [-i GTF] [-o TXT] [-a key_name] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Add a new key to transcript features containing a comma-separated list of exon sizes.

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN. (default: <stdin>)
 -o, --outputfile             Output GTF file. (default: <stdout>)
 -a, --key-name               The name of the key. (default: exon_sizes)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

intron_sizes

Description: Add a new key to transcript features containing a comma-separated list of intron sizes.

Example:

$ gtftk intron_sizes -i simple.gtf | gtftk select_by_key -t | gtftk tabulate -k transcript_id,intron_sizes
transcript_id	intron_sizes
G0001T002	0
G0001T001	0
G0002T001	0
G0003T001	2
G0004T002	2,2
G0004T001	2,2
G0005T001	6
G0006T001	2,2
G0006T002	2
G0007T001	0
G0007T002	0
G0008T001	5
G0009T002	0
G0009T001	0
G0010T001	0

Arguments:

$ gtftk intron_sizes -h
  Usage: gtftk intron_sizes [-i GTF] [-o GTF] [-a key_name] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Add a new key to transcript features containing a comma-separated list of intron-size.

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN. (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -a, --key-name               The name of the key. (default: intron_sizes)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)