Commands from section ‘coordinates’

In this section we will require the following datasets:

$ gtftk get_example -q -d simple -f '*'

midpoints

Description: Get the genomic midpoint of each feature: genes, transcripts, exons or introns. Output is currently in bed format only.

Example: Get the midpoints of all transcripts and exons.

$ gtftk midpoints -i simple.gtf -t transcript,exon -n transcript_id,feature | head -n 5
chr1	7	9	G0009T002|transcript	.	-
chr1	7	9	G0009T001|exon	.	-
chr1	7	9	G0009T001|transcript	.	-
chr1	7	9	G0009T002|exon	.	-
chr1	27	29	G0006T001|transcript	.	-

Arguments:

$ gtftk midpoints -h
  Usage: gtftk midpoints [-i GTF/BED] [-o BED] [-t ft_type] [-n NAME] [-s SEP] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Get the midpoint coordinates for the requested feature. Output is bed format.

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file (BED). (default: <stdout>)
 -t, --ft-type                The target feature (as found in the 3rd column of the GTF). (default: transcript)
 -n, --names                  The key(s) that should be used as name. (default: transcript_id)
 -s, --separator              The separator to be used for separating name elements (see -n). (default: |)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

get_5p_3p_coords

Description: Get the 5p or 3p coordinates for each feature (e.g TSS or TTS for a transcript). Output is bed format.

Example: Get the 5p ends of transcripts and exons.

$ gtftk get_5p_3p_coords  -i simple.gtf  -t transcript,exon -n transcript_id,gene_id,feature | head -n 5
chr1	124	125	G0001T002|G0001|transcript	.	+
chr1	124	125	G0001T002|G0001|exon	.	+
chr1	124	125	G0001T001|G0001|transcript	.	+
chr1	124	125	G0001T001|G0001|exon	.	+
chr1	179	180	G0002T001|G0002|transcript	.	+

Arguments:

$ gtftk get_5p_3p_coords -h
  Usage: gtftk get_5p_3p_coords [-i GTF] [-o BED] [-t ft_type] [-v] [-p transpose] [-n NAME] [-m more_names] [-s SEP] [-e] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Get the 5p or 3p coordinate for each feature (e.g TSS or TTS for a transcript).

  Notes:
     *  Output is in BED format.

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file (BED). (default: <stdout>)
 -t, --ft-type                The target feature (as found in the 3rd column of the GTF). (default: transcript)
 -v, --invert                 Get 3' coordinate. (default: False)
 -p, --transpose              Transpose coordinate in 5' (use negative value) or in 3' (use positive values). (default: 0)
 -n, --names                  The key(s) that should be used as name. (default: gene_id,transcript_id)
 -m, --more-names             A comma-separated list of information to be added to the 'name' column of the bed file. (default: None)
 -s, --separator              The separator to be used for separating name elements (see -n). (default: |)
 -e, --explicit               Write explicitly the name of the keys in the header. (default: False)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

intergenic

Description: Extract intergenic regions. This command requires a chromInfo file to compute the bed file boundaries. The command will print the coordinates of genomic regions without transcript features.

Example: Simply get intergenic regions.

$ gtftk intergenic -i simple.gtf -c simple.chromInfo
chr1	0	2	region_1	0	.
chr1	14	21	region_2	0	.
chr1	47	49	region_3	0	.
chr1	61	64	region_4	0	.
chr1	76	106	region_5	0	.
chr1	116	124	region_6	0	.
chr1	138	175	region_7	0	.
chr1	189	209	region_8	0	.
chr1	222	300	region_9	0	.
chr2	0	600	region_10	0	.

Arguments:

$ gtftk intergenic -h
  Usage: gtftk intergenic [-i GTF] [-o BED] -c CHROMINFO [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Extract intergenic regions. This command requires a chromInfo file to compute the bed file
     boundaries. The command will print the coordinates of genomic regions without any transcript
     features.

  Notes:
     *  chrom-info may also accept 'mm8', 'mm9', 'mm10', 'hg19', 'hg38', 'rn3' or 'rn4'. In this
     case the  corresponding size of conventional chromosomes are used. To get the size of  the
     chromosome in ensembl format (whithout chr prefix), use 'mm8_ens', 'mm9_ens',  'mm10_ens',
     'hg19_ens', 'hg38_ens', 'rn3_ens' or 'rn4_ens'. ChrM is not used.

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file (BED). (default: <stdout>)
 -c, --chrom-info             Tabulated two-columns file. Chromosomes as column 1 and their sizes as column 2 (default: None)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

intronic

Description: Returns a bed file containing the intronic regions. If by_transcript is false (default), returns merged genic regions with no exonic overlap (“strict” mode). Otherwise, the intronic regions corresponding to each transcript are returned (may contain exonic overlap and redundancy).

Example: Simply get intronic regions.

$ gtftk intronic -i simple.gtf | head -n 5
chr1	25	27
chr1	30	32
chr1	35	41
chr1	54	56
chr1	68	70

Arguments:

$ gtftk intronic -h
  Usage: gtftk intronic [-i GTF] [-o BED] [-b] [-n NAME] [-s SEP] [-w] [-F] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Returns a bed file containing the intronic regions. If by_transcript is false (default), returns
     merged genic regions with no exonic overlap ("strict" mode). Otherwise, the intronic regions
     corresponding to each transcript are returned (may contain exonic overlap and redundancy).

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file (BED). (default: <stdout>)
 -b, --by-transcript          The intronic regions are returned for each transcript. (default: False)
 -n, --names                  The key(s) that should be used as name (if -b is used). (default: gene_id,transcript_id)
 -s, --separator              The separator to be used for separating name elements (if -b is used). (default: |)
 -w, --intron-nb-in-name      By default intron number is written in 'score' column. Force it to be written in 'name' column. (default: False)
 -F, --no-feature-name        Don't add the feature name ('intron') in the name column. (default: False)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

splicing_site

Description: Compute the locations of donor and acceptor splice sites. This command will return a single position, which corresponds to the most 5’ and/or the most 3’ intronic region. If the gtf file does not contain exon numbering you can compute it using the add_exon_nb command. The score column of the bed file contains the number of the closest exon relative to the splice site.

Example:

$ gtftk add_exon_nb -i simple.gtf -k exon_nbr | gtftk splicing_site  -k exon_nbr| head
chr1	54	55	acceptor|G0003T001E001|G0003T001|G0003	2	-
chr1	55	56	donor|G0003T001E002|G0003T001|G0003	1	-
chr1	68	69	donor|G0004T002E001|G0004T002|G0004	1	+
chr1	71	72	donor|G0004T002E002|G0004T002|G0004	2	+
chr1	69	70	acceptor|G0004T002E002|G0004T002|G0004	2	+
chr1	72	73	acceptor|G0004T002E003|G0004T002|G0004	3	+
chr1	68	69	donor|G0004T001E001|G0004T001|G0004	1	+
chr1	71	72	donor|G0004T001E002|G0004T001|G0004	2	+
chr1	69	70	acceptor|G0004T001E002|G0004T001|G0004	2	+
chr1	72	73	acceptor|G0004T001E003|G0004T001|G0004	3	+

Arguments:

$ gtftk splicing_site -h
  Usage: gtftk splicing_site [-i GTF] [-o BED] [-k exon_numbering_key] [-n NAME] [-s SEP] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Compute the locations of donor and acceptor splice sites.

  Notes:
     *  This will return a single position, which corresponds to the most 5' and/or the most 3'
     intronic region. If the gtf file does not contain exon numbering you can compute it using the
     add_exon_nb command. The score column of the bed file contains the number of the closest exon
     relative to the splice site.

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -k, --exon-numbering-key     The name of the key containing the exon numbering (exon_number in ensembl) (default: exon_number)
 -n, --names                  The key(s) that should be used as name. (default: exon_id,transcript_id,gene_id)
 -s, --separator              The separator to be used for separating name elements (see -n). (default: |)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

shift

Description: Shift coordinates in 3’ or 5’ direction.

Example:

$ gtftk get_example|  head -n 1
chr1	gtftk	gene	125	138	.	+	.	gene_id "G0001";
$ gtftk shift -i simple.gtf  -s -10 -c simple.chromInfo | head -n 1
chr1	gtftk	gene	115	128	.	+	.	gene_id "G0001";

Arguments:

$ gtftk shift -h
  Usage: gtftk shift [-i GTF] [-o GTF] -s shift_value [-d] [-a] -c CHROMINFO [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Transpose coordinates in 3' or 5' direction.

  Notes:
     *  chrom-info may also accept 'mm8', 'mm9', 'mm10', 'hg19', 'hg38', 'rn3' or 'rn4'. In this
     case the  corresponding size of conventional chromosomes are used. To get the size of  the
     chromosome in ensembl format (whithout chr prefix), use 'mm8_ens', 'mm9_ens',  'mm10_ens',
     'hg19_ens', 'hg38_ens', 'rn3_ens' or 'rn4_ens'. ChrM is not used.
     *  By default shift is not strand specific. Meaning that if -\shift-value is set to 10, all
     coordinates will be moved 10 bases in 5' direction relative to the forward/watson/plus/top
     strand.
     *  Use a negative value to shift in 3' direction, a positive value to shift in 5' direction.
     *  If --stranded is true, features are transposed in 5' direction relative to their associated
     strand.
     *  By default, features are not allowed to go outside the genome coordinates. In the current
     implementation, in case this would happen (using a very large -\shift-value), feature would
     accumulate at the ends of chromosomes irrespectively of gene or transcript structures giving
     rise, ultimately, to several exons from the same transcript having the same starts or ends.
     *  One can forced features to go outside the genome and ultimatly dissapear with large
     --shift-value by using -a.

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -s, --shift-value            Shift coordinate by s nucleotides. (default: 0)
 -d, --stranded               By default shift not . (default: False)
 -a, --allow-outside          Accept the partial or total disappearance of a feature upon shifting. (default: False)
 -c, --chrom-info             Tabulated two-columns file. Chromosomes as column 1 and sizes as column 2 (default: None)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)