Commands from section ‘conversion’

In this section we will require the following datasets:

$ gtftk get_example -q -d simple -f '*'

convert

Description: This command can be used to convert to various formats. Currently only a limited number is supported.

  • bed: classical bed6 format.

  • bed6: classical bed6 format.

  • bed3: bed3 format.

Example: Get the gene features and convert them to bed6.

$ gtftk select_by_key -i simple.gtf -k feature -v gene | gtftk convert -n gene_id  -f bed6| head -n 3
chr1	124	138	G0001	.	+
chr1	179	189	G0002	.	+
chr1	49	61	G0003	.	-

Arguments:

$ gtftk convert -h
  Usage: gtftk convert [-i GTF] [-o BED/BED3/BED6] [-n NAME] [-s SEP] [-m more_names] [-f {bed,bed3,bed6}] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Convert a GTF to various format (still limited).

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN. (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -n, --names                  The key(s) that should be used as name. (default: gene_id,transcript_id)
 -s, --separator              The separator to be used for separating name elements (see -n). (default: |)
 -m, --more-names             Add this information to the 'name' column of the BED file. (default: )
 -f, --format                 Currently one of bed3, bed6 (default: bed6)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

tabulate

Description: Extract key/values from the GTF and convert them to tabulated format. When requesting coordinates they will be provided in 1-based format.

Example: Simply get the list of transcripts and gene.

$ gtftk select_by_key -i simple.gtf -k feature -v transcript| gtftk tabulate -k gene_id,transcript_id -s "|"
gene_id|transcript_id
G0001|G0001T002
G0001|G0001T001
G0002|G0002T001
G0003|G0003T001
G0004|G0004T002
G0004|G0004T001
G0005|G0005T001
G0006|G0006T001
G0006|G0006T002
G0007|G0007T001
G0007|G0007T002
G0008|G0008T001
G0009|G0009T002
G0009|G0009T001
G0010|G0010T001

Warning

By default tabulate will discard any line for which one of the selected key is not defined. Use -x (–accept-undef) to print them.

Arguments:

$ gtftk tabulate -h
  Usage: gtftk tabulate [-i GTF] [-o TXT] [-s SEPARATOR] [-k KEY,KEY...] [-u] [-H] [-n] [-x] [-b] [-t | -g | -a | -e] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Convert a GTF to tabulated format.

  Notes:
     *  Warning: by default tabulate will discard any line for which one of the selected key is not
     defined. Use -x (--accept-undef) to print them.
     *  To refer to default keys use: seqid,source,feature,start,end,frame,gene_id...
     *  Note that 'all' or '*' are special keys that can be used to convert the whole GTF into a
     tabulated file. Thanks @fafa13.

optional arguments:
 -t, --select-transcript-ids  A shortcuts for "-k transcript_id". (default: False)
 -g, --select-gene_ids        A shortcuts for "-k gene_id". (default: False)
 -a, --select-gene-names      A shortcuts for "-k gene_name". (default: False)
 -e, --select-exon-ids        A shortcuts for "-k exon_ids". (default: False)

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -s, --separator              The output field separator. (default: )
 -k, --key                    A comma-separated list of key names. (default: *)
 -u, --unique                 Print a non redondant list of lines. (default: False)
 -H, --no-header              Don't print the header line. (default: False)
 -n, --no-unset               Don't print lines containing '.' (unset values) (default: False)
 -x, --accept-undef           Print line for which the key is undefined (i.e, '?', does not exists). (default: False)
 -b, --no-basic               In case key is set to 'all' or '*', don't write basic attributes. (default: False)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

bed_to_gtf

Description: Convert a bed file to gtf-like format.

Example:

$ gtftk convert -i simple.gtf | gtftk bed_to_gtf -t transcript | head -n 5
chr1	Unknown	transcript	125	138	.	+	.	gene_id "G0001|?"; transcript_id "G0001|?";
chr1	Unknown	transcript	125	138	.	+	.	gene_id "G0001|G0001T002"; transcript_id "G0001|G0001T002";
chr1	Unknown	transcript	125	138	.	+	.	gene_id "G0001|G0001T002"; transcript_id "G0001|G0001T002";
chr1	Unknown	transcript	125	130	.	+	.	gene_id "G0001|G0001T002"; transcript_id "G0001|G0001T002";
chr1	Unknown	transcript	125	138	.	+	.	gene_id "G0001|G0001T001"; transcript_id "G0001|G0001T001";

Arguments:

$ gtftk bed_to_gtf -h
  Usage: gtftk bed_to_gtf [-i BED] [-o GTF] [-t ft_type] [-s source] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Convert a bed file to a gtf. This will make the poor bed feel as if it was a big/fat gtf (but with
     lots of empty fields...sniff). May be helpful sometimes...

Arguments:
 -i, --inputfile              Path to the poor BED file to would like to behave as if it was a GTF. (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -t, --ft-type                The type of features you are trying to mimic... (default: transcript)
 -s, --source                 The source of annotation. (default: Unknown)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

convert_ensembl

Description: Convert the GTF file to ensembl format. Essentially add ‘transcript’/’gene’ features.

Example: Delete gene and transcript feature. Regenerate them.

$ gtftk select_by_key -i simple.gtf -k feature -v gene,transcript -n| gtftk convert_ensembl | gtftk select_by_key -k gene_id -v G0001
chr1	gtftk	gene	125	138	.	+	.	gene_id "G0001";
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002";
chr1	gtftk	exon	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002"; exon_id "G0001T002E001";
chr1	gtftk	CDS	125	130	.	+	.	gene_id "G0001"; transcript_id "G0001T002"; ccds_id "CDS_G0001T002";
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T001";
chr1	gtftk	exon	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T001"; exon_id "G0001T001E001";
chr1	gtftk	CDS	130	132	.	+	.	gene_id "G0001"; transcript_id "G0001T001"; ccds_id "CDS_G0001T001";

Arguments:

$ gtftk bed_to_gtf -h
  Usage: gtftk bed_to_gtf [-i BED] [-o GTF] [-t ft_type] [-s source] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Convert a bed file to a gtf. This will make the poor bed feel as if it was a big/fat gtf (but with
     lots of empty fields...sniff). May be helpful sometimes...

Arguments:
 -i, --inputfile              Path to the poor BED file to would like to behave as if it was a GTF. (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -t, --ft-type                The type of features you are trying to mimic... (default: transcript)
 -s, --source                 The source of annotation. (default: Unknown)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)