Commands from section ‘conversion’¶
In this section we will require the following datasets:
$ gtftk get_example -q -d simple -f '*'
convert¶
Description: This command can be used to convert to various formats. Currently only a limited number is supported.
bed: classical bed6 format.
bed6: classical bed6 format.
bed3: bed3 format.
Example: Get the gene features and convert them to bed6.
$ gtftk select_by_key -i simple.gtf -k feature -v gene | gtftk convert -n gene_id -f bed6| head -n 3
chr1 124 138 G0001 . +
chr1 179 189 G0002 . +
chr1 49 61 G0003 . -
Arguments:
$ gtftk convert -h
Usage: gtftk convert [-i GTF] [-o BED/BED3/BED6] [-n NAME] [-s SEP] [-m more_names] [-f {bed,bed3,bed6}] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Convert a GTF to various format (still limited).
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN. (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-n, --names The key(s) that should be used as name. (default: gene_id,transcript_id)
-s, --separator The separator to be used for separating name elements (see -n). (default: |)
-m, --more-names Add this information to the 'name' column of the BED file. (default: )
-f, --format Currently one of bed3, bed6 (default: bed6)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
tabulate¶
Description: Extract key/values from the GTF and convert them to tabulated format. When requesting coordinates they will be provided in 1-based format.
Example: Simply get the list of transcripts and gene.
$ gtftk select_by_key -i simple.gtf -k feature -v transcript| gtftk tabulate -k gene_id,transcript_id -s "|"
gene_id|transcript_id
G0001|G0001T002
G0001|G0001T001
G0002|G0002T001
G0003|G0003T001
G0004|G0004T002
G0004|G0004T001
G0005|G0005T001
G0006|G0006T001
G0006|G0006T002
G0007|G0007T001
G0007|G0007T002
G0008|G0008T001
G0009|G0009T002
G0009|G0009T001
G0010|G0010T001
Warning
By default tabulate will discard any line for which one of the selected key is not defined. Use -x (–accept-undef) to print them.
Arguments:
$ gtftk tabulate -h
Usage: gtftk tabulate [-i GTF] [-o TXT] [-s SEPARATOR] [-k KEY,KEY...] [-u] [-H] [-n] [-x] [-b] [-t | -g | -a | -e] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Convert a GTF to tabulated format.
Notes:
* Warning: by default tabulate will discard any line for which one of the selected key is not
defined. Use -x (--accept-undef) to print them.
* To refer to default keys use: seqid,source,feature,start,end,frame,gene_id...
* Note that 'all' or '*' are special keys that can be used to convert the whole GTF into a
tabulated file. Thanks @fafa13.
optional arguments:
-t, --select-transcript-ids A shortcuts for "-k transcript_id". (default: False)
-g, --select-gene_ids A shortcuts for "-k gene_id". (default: False)
-a, --select-gene-names A shortcuts for "-k gene_name". (default: False)
-e, --select-exon-ids A shortcuts for "-k exon_ids". (default: False)
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-s, --separator The output field separator. (default: )
-k, --key A comma-separated list of key names. (default: *)
-u, --unique Print a non redondant list of lines. (default: False)
-H, --no-header Don't print the header line. (default: False)
-n, --no-unset Don't print lines containing '.' (unset values) (default: False)
-x, --accept-undef Print line for which the key is undefined (i.e, '?', does not exists). (default: False)
-b, --no-basic In case key is set to 'all' or '*', don't write basic attributes. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
bed_to_gtf¶
Description: Convert a bed file to gtf-like format.
Example:
$ gtftk convert -i simple.gtf | gtftk bed_to_gtf -t transcript | head -n 5
chr1 Unknown transcript 125 138 . + . gene_id "G0001|?"; transcript_id "G0001|?";
chr1 Unknown transcript 125 138 . + . gene_id "G0001|G0001T002"; transcript_id "G0001|G0001T002";
chr1 Unknown transcript 125 138 . + . gene_id "G0001|G0001T002"; transcript_id "G0001|G0001T002";
chr1 Unknown transcript 125 130 . + . gene_id "G0001|G0001T002"; transcript_id "G0001|G0001T002";
chr1 Unknown transcript 125 138 . + . gene_id "G0001|G0001T001"; transcript_id "G0001|G0001T001";
Arguments:
$ gtftk bed_to_gtf -h
Usage: gtftk bed_to_gtf [-i BED] [-o GTF] [-t ft_type] [-s source] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Convert a bed file to a gtf. This will make the poor bed feel as if it was a big/fat gtf (but with
lots of empty fields...sniff). May be helpful sometimes...
Arguments:
-i, --inputfile Path to the poor BED file to would like to behave as if it was a GTF. (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-t, --ft-type The type of features you are trying to mimic... (default: transcript)
-s, --source The source of annotation. (default: Unknown)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
convert_ensembl¶
Description: Convert the GTF file to ensembl format. Essentially add ‘transcript’/’gene’ features.
Example: Delete gene and transcript feature. Regenerate them.
$ gtftk select_by_key -i simple.gtf -k feature -v gene,transcript -n| gtftk convert_ensembl | gtftk select_by_key -k gene_id -v G0001
chr1 gtftk gene 125 138 . + . gene_id "G0001";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T002";
chr1 gtftk exon 125 138 . + . gene_id "G0001"; transcript_id "G0001T002"; exon_id "G0001T002E001";
chr1 gtftk CDS 125 130 . + . gene_id "G0001"; transcript_id "G0001T002"; ccds_id "CDS_G0001T002";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T001";
chr1 gtftk exon 125 138 . + . gene_id "G0001"; transcript_id "G0001T001"; exon_id "G0001T001E001";
chr1 gtftk CDS 130 132 . + . gene_id "G0001"; transcript_id "G0001T001"; ccds_id "CDS_G0001T001";
Arguments:
$ gtftk bed_to_gtf -h
Usage: gtftk bed_to_gtf [-i BED] [-o GTF] [-t ft_type] [-s source] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Convert a bed file to a gtf. This will make the poor bed feel as if it was a big/fat gtf (but with
lots of empty fields...sniff). May be helpful sometimes...
Arguments:
-i, --inputfile Path to the poor BED file to would like to behave as if it was a GTF. (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-t, --ft-type The type of features you are trying to mimic... (default: transcript)
-s, --source The source of annotation. (default: Unknown)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)