Commands from section ‘coordinates’¶
In this section we will require the following datasets:
$ gtftk get_example -q -d simple -f '*'
midpoints¶
Description: Get the genomic midpoint of each feature: genes, transcripts, exons or introns. Output is currently in bed format only.
Example: Get the midpoints of all transcripts and exons.
$ gtftk midpoints -i simple.gtf -t transcript,exon -n transcript_id,feature | head -n 5
chr1 7 9 G0009T002|transcript . -
chr1 7 9 G0009T001|exon . -
chr1 7 9 G0009T001|transcript . -
chr1 7 9 G0009T002|exon . -
chr1 27 29 G0006T001|transcript . -
Arguments:
$ gtftk midpoints -h
Usage: gtftk midpoints [-i GTF/BED] [-o BED] [-t ft_type] [-n NAME] [-s SEP] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Get the midpoint coordinates for the requested feature. Output is bed format.
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file (BED). (default: <stdout>)
-t, --ft-type The target feature (as found in the 3rd column of the GTF). (default: transcript)
-n, --names The key(s) that should be used as name. (default: transcript_id)
-s, --separator The separator to be used for separating name elements (see -n). (default: |)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
get_5p_3p_coords¶
Description: Get the 5p or 3p coordinates for each feature (e.g TSS or TTS for a transcript). Output is bed format.
Example: Get the 5p ends of transcripts and exons.
$ gtftk get_5p_3p_coords -i simple.gtf -t transcript,exon -n transcript_id,gene_id,feature | head -n 5
chr1 124 125 G0001T002|G0001|transcript . +
chr1 124 125 G0001T002|G0001|exon . +
chr1 124 125 G0001T001|G0001|transcript . +
chr1 124 125 G0001T001|G0001|exon . +
chr1 179 180 G0002T001|G0002|transcript . +
Arguments:
$ gtftk get_5p_3p_coords -h
Usage: gtftk get_5p_3p_coords [-i GTF] [-o BED] [-t ft_type] [-v] [-p transpose] [-n NAME] [-m more_names] [-s SEP] [-e] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Get the 5p or 3p coordinate for each feature (e.g TSS or TTS for a transcript).
Notes:
* Output is in BED format.
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file (BED). (default: <stdout>)
-t, --ft-type The target feature (as found in the 3rd column of the GTF). (default: transcript)
-v, --invert Get 3' coordinate. (default: False)
-p, --transpose Transpose coordinate in 5' (use negative value) or in 3' (use positive values). (default: 0)
-n, --names The key(s) that should be used as name. (default: gene_id,transcript_id)
-m, --more-names A comma-separated list of information to be added to the 'name' column of the bed file. (default: None)
-s, --separator The separator to be used for separating name elements (see -n). (default: |)
-e, --explicit Write explicitly the name of the keys in the header. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
intergenic¶
Description: Extract intergenic regions. This command requires a chromInfo file to compute the bed file boundaries. The command will print the coordinates of genomic regions without transcript features.
Example: Simply get intergenic regions.
$ gtftk intergenic -i simple.gtf -c simple.chromInfo
chr1 0 2 region_1 0 .
chr1 14 21 region_2 0 .
chr1 47 49 region_3 0 .
chr1 61 64 region_4 0 .
chr1 76 106 region_5 0 .
chr1 116 124 region_6 0 .
chr1 138 175 region_7 0 .
chr1 189 209 region_8 0 .
chr1 222 300 region_9 0 .
chr2 0 600 region_10 0 .
Arguments:
$ gtftk intergenic -h
Usage: gtftk intergenic [-i GTF] [-o BED] -c CHROMINFO [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Extract intergenic regions. This command requires a chromInfo file to compute the bed file
boundaries. The command will print the coordinates of genomic regions without any transcript
features.
Notes:
* chrom-info may also accept 'mm8', 'mm9', 'mm10', 'hg19', 'hg38', 'rn3' or 'rn4'. In this
case the corresponding size of conventional chromosomes are used. To get the size of the
chromosome in ensembl format (whithout chr prefix), use 'mm8_ens', 'mm9_ens', 'mm10_ens',
'hg19_ens', 'hg38_ens', 'rn3_ens' or 'rn4_ens'. ChrM is not used.
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file (BED). (default: <stdout>)
-c, --chrom-info Tabulated two-columns file. Chromosomes as column 1 and their sizes as column 2 (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
intronic¶
Description: Returns a bed file containing the intronic regions. If by_transcript is false (default), returns merged genic regions with no exonic overlap (“strict” mode). Otherwise, the intronic regions corresponding to each transcript are returned (may contain exonic overlap and redundancy).
Example: Simply get intronic regions.
$ gtftk intronic -i simple.gtf | head -n 5
chr1 25 27
chr1 30 32
chr1 35 41
chr1 54 56
chr1 68 70
Arguments:
$ gtftk intronic -h
Usage: gtftk intronic [-i GTF] [-o BED] [-b] [-n NAME] [-s SEP] [-w] [-F] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Returns a bed file containing the intronic regions. If by_transcript is false (default), returns
merged genic regions with no exonic overlap ("strict" mode). Otherwise, the intronic regions
corresponding to each transcript are returned (may contain exonic overlap and redundancy).
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file (BED). (default: <stdout>)
-b, --by-transcript The intronic regions are returned for each transcript. (default: False)
-n, --names The key(s) that should be used as name (if -b is used). (default: gene_id,transcript_id)
-s, --separator The separator to be used for separating name elements (if -b is used). (default: |)
-w, --intron-nb-in-name By default intron number is written in 'score' column. Force it to be written in 'name' column. (default: False)
-F, --no-feature-name Don't add the feature name ('intron') in the name column. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
splicing_site¶
Description: Compute the locations of donor and acceptor splice sites. This command will return a single position, which corresponds to the most 5’ and/or the most 3’ intronic region. If the gtf file does not contain exon numbering you can compute it using the add_exon_nb command. The score column of the bed file contains the number of the closest exon relative to the splice site.
Example:
$ gtftk add_exon_nb -i simple.gtf -k exon_nbr | gtftk splicing_site -k exon_nbr| head
chr1 54 55 acceptor|G0003T001E001|G0003T001|G0003 2 -
chr1 55 56 donor|G0003T001E002|G0003T001|G0003 1 -
chr1 68 69 donor|G0004T002E001|G0004T002|G0004 1 +
chr1 71 72 donor|G0004T002E002|G0004T002|G0004 2 +
chr1 69 70 acceptor|G0004T002E002|G0004T002|G0004 2 +
chr1 72 73 acceptor|G0004T002E003|G0004T002|G0004 3 +
chr1 68 69 donor|G0004T001E001|G0004T001|G0004 1 +
chr1 71 72 donor|G0004T001E002|G0004T001|G0004 2 +
chr1 69 70 acceptor|G0004T001E002|G0004T001|G0004 2 +
chr1 72 73 acceptor|G0004T001E003|G0004T001|G0004 3 +
Arguments:
$ gtftk splicing_site -h
Usage: gtftk splicing_site [-i GTF] [-o BED] [-k exon_numbering_key] [-n NAME] [-s SEP] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Compute the locations of donor and acceptor splice sites.
Notes:
* This will return a single position, which corresponds to the most 5' and/or the most 3'
intronic region. If the gtf file does not contain exon numbering you can compute it using the
add_exon_nb command. The score column of the bed file contains the number of the closest exon
relative to the splice site.
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --exon-numbering-key The name of the key containing the exon numbering (exon_number in ensembl) (default: exon_number)
-n, --names The key(s) that should be used as name. (default: exon_id,transcript_id,gene_id)
-s, --separator The separator to be used for separating name elements (see -n). (default: |)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
shift¶
Description: Shift coordinates in 3’ or 5’ direction.
Example:
$ gtftk get_example| head -n 1
chr1 gtftk gene 125 138 . + . gene_id "G0001";
$ gtftk shift -i simple.gtf -s -10 -c simple.chromInfo | head -n 1
chr1 gtftk gene 115 128 . + . gene_id "G0001";
Arguments:
$ gtftk shift -h
Usage: gtftk shift [-i GTF] [-o GTF] -s shift_value [-d] [-a] -c CHROMINFO [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Transpose coordinates in 3' or 5' direction.
Notes:
* chrom-info may also accept 'mm8', 'mm9', 'mm10', 'hg19', 'hg38', 'rn3' or 'rn4'. In this
case the corresponding size of conventional chromosomes are used. To get the size of the
chromosome in ensembl format (whithout chr prefix), use 'mm8_ens', 'mm9_ens', 'mm10_ens',
'hg19_ens', 'hg38_ens', 'rn3_ens' or 'rn4_ens'. ChrM is not used.
* By default shift is not strand specific. Meaning that if -\shift-value is set to 10, all
coordinates will be moved 10 bases in 5' direction relative to the forward/watson/plus/top
strand.
* Use a negative value to shift in 3' direction, a positive value to shift in 5' direction.
* If --stranded is true, features are transposed in 5' direction relative to their associated
strand.
* By default, features are not allowed to go outside the genome coordinates. In the current
implementation, in case this would happen (using a very large -\shift-value), feature would
accumulate at the ends of chromosomes irrespectively of gene or transcript structures giving
rise, ultimately, to several exons from the same transcript having the same starts or ends.
* One can forced features to go outside the genome and ultimatly dissapear with large
--shift-value by using -a.
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-s, --shift-value Shift coordinate by s nucleotides. (default: 0)
-d, --stranded By default shift not . (default: False)
-a, --allow-outside Accept the partial or total disappearance of a feature upon shifting. (default: False)
-c, --chrom-info Tabulated two-columns file. Chromosomes as column 1 and sizes as column 2 (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)