Warning about supported GTF file formats

Warning

Most of the commands of the gtftk suite are designed to handle files in Ensembl GTF format and thus require transcript and gene features/lines in the GTF. All lines must contain a transcript_id and gene_id value except the gene feature that should contain only the gene_id (see get_example command for an example). Transcript and gene lines will be used when required to get access to transcript and gene coordinates. This solution was chosen to define a reference GTF file format for (py)gtftk (since Ensembl format is probably the most widely used).

You can use the convert_ensembl subcommand to convert your non- (or old) ensembl format to current ensembl format.

Below an example in which we first select only exon features then use convert_ensembl to re-generate gene and transcript features using convert_ensembl .

$ gtftk get_example | gtftk select_by_key -k feature  -v exon | head -n 10
chr1	gtftk	exon	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002"; exon_id "G0001T002E001";
chr1	gtftk	exon	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T001"; exon_id "G0001T001E001";
chr1	gtftk	exon	180	189	.	+	.	gene_id "G0002"; transcript_id "G0002T001"; exon_id "G0002T001E001";
chr1	gtftk	exon	50	54	.	-	.	gene_id "G0003"; transcript_id "G0003T001"; exon_id "G0003T001E001";
chr1	gtftk	exon	57	61	.	-	.	gene_id "G0003"; transcript_id "G0003T001"; exon_id "G0003T001E002";
chr1	gtftk	exon	65	68	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E001";
chr1	gtftk	exon	71	71	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E002";
chr1	gtftk	exon	74	76	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E003";
chr1	gtftk	exon	65	68	.	+	.	gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E001";
chr1	gtftk	exon	71	71	.	+	.	gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E002";
$ gtftk get_example | gtftk select_by_key -k feature  -v exon | gtftk  convert_ensembl | head -n 10
chr1	gtftk	gene	125	138	.	+	.	gene_id "G0001";
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002";
chr1	gtftk	exon	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002"; exon_id "G0001T002E001";
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T001";
chr1	gtftk	exon	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T001"; exon_id "G0001T001E001";
chr1	gtftk	gene	180	189	.	+	.	gene_id "G0002";
chr1	gtftk	transcript	180	189	.	+	.	gene_id "G0002"; transcript_id "G0002T001";
chr1	gtftk	exon	180	189	.	+	.	gene_id "G0002"; transcript_id "G0002T001"; exon_id "G0002T001E001";
chr1	gtftk	gene	50	61	.	-	.	gene_id "G0003";
chr1	gtftk	transcript	50	61	.	-	.	gene_id "G0003"; transcript_id "G0003T001";

Arguments:

$ gtftk convert_ensembl -h
  Usage: gtftk convert_ensembl [-i GTF] [-o GTF] [-n] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Convert the GTF file to ensembl format. It will essentially add a 'transcript' feature and 'gene'
     feature when required. This command can be viewed as a 'groomer' command for those starting
     with a non ensembl GTF.

  Notes:
     *  The gtftk program is designed to handle files in ensembl GTF format. This means that the GTF
     file provided to gtftk must contain transcript and gene feature/lines. They will be used to get
     access to transcript and gene coordinates whenever needed. This solution was chosen to define a
     reference GTF file format for gtftk (since Ensembl format is probably the most widely used).
     *  Almost all commands of gtftk use transcript_id or gene_id as keys to perform operation on
     genomic coordinates. One of the most common issue when working with  gene coordinates is the
     lack  of non ambiguous gene or transcript names For instance, a refSeq sequence ID used as
     transcript_id can be associated to  several chromosomal locations as a sequence may be
     duplicated. These identifiers are ambiguous and thus should be avoid. Use UCSC or ensembl IDs
     instead.

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -n, --no-check-gene-chr      By default the command raise an error if several chromosomes are associated with the same gene_id. Disable this behaviour (but you should better think about what it means...). (default: False)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

Note

any comment line (i.e. starting with #) or empty line in the gtf file will be ignore (discarded) by gtftk.

Naming conventions

Note

We will use the terms attribute or key for any descriptor found in the 9th column (e.g. transcript_id) and the term value for its associated string (e.g. “NM_334567”). The eight first columns of the GTF file (chrom/seqid, source, type, start, end, score, strand, frame) will be refered as basic attributes. In the example below, gene_id is the attribute and ‘G0001’ is the associated value.

$ gtftk get_example| gtftk select_by_key -k feature -v gene| head -1
chr1	gtftk	gene	125	138	.	+	.	gene_id "G0001";