Warning about supported GTF file formats¶
Warning
Most of the commands of the gtftk suite are designed to handle files in Ensembl GTF format and thus require transcript and gene features/lines in the GTF. All lines must contain a transcript_id and gene_id value except the gene feature that should contain only the gene_id (see get_example command for an example). Transcript and gene lines will be used when required to get access to transcript and gene coordinates. This solution was chosen to define a reference GTF file format for (py)gtftk (since Ensembl format is probably the most widely used).
You can use the convert_ensembl subcommand to convert your non- (or old) ensembl format to current ensembl format.
Below an example in which we first select only exon features then use convert_ensembl to re-generate gene and transcript features using convert_ensembl .
$ gtftk get_example | gtftk select_by_key -k feature -v exon | head -n 10
chr1 gtftk exon 125 138 . + . gene_id "G0001"; transcript_id "G0001T002"; exon_id "G0001T002E001";
chr1 gtftk exon 125 138 . + . gene_id "G0001"; transcript_id "G0001T001"; exon_id "G0001T001E001";
chr1 gtftk exon 180 189 . + . gene_id "G0002"; transcript_id "G0002T001"; exon_id "G0002T001E001";
chr1 gtftk exon 50 54 . - . gene_id "G0003"; transcript_id "G0003T001"; exon_id "G0003T001E001";
chr1 gtftk exon 57 61 . - . gene_id "G0003"; transcript_id "G0003T001"; exon_id "G0003T001E002";
chr1 gtftk exon 65 68 . + . gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E001";
chr1 gtftk exon 71 71 . + . gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E002";
chr1 gtftk exon 74 76 . + . gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E003";
chr1 gtftk exon 65 68 . + . gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E001";
chr1 gtftk exon 71 71 . + . gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E002";
$ gtftk get_example | gtftk select_by_key -k feature -v exon | gtftk convert_ensembl | head -n 10
chr1 gtftk gene 125 138 . + . gene_id "G0001";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T002";
chr1 gtftk exon 125 138 . + . gene_id "G0001"; transcript_id "G0001T002"; exon_id "G0001T002E001";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T001";
chr1 gtftk exon 125 138 . + . gene_id "G0001"; transcript_id "G0001T001"; exon_id "G0001T001E001";
chr1 gtftk gene 180 189 . + . gene_id "G0002";
chr1 gtftk transcript 180 189 . + . gene_id "G0002"; transcript_id "G0002T001";
chr1 gtftk exon 180 189 . + . gene_id "G0002"; transcript_id "G0002T001"; exon_id "G0002T001E001";
chr1 gtftk gene 50 61 . - . gene_id "G0003";
chr1 gtftk transcript 50 61 . - . gene_id "G0003"; transcript_id "G0003T001";
Arguments:
$ gtftk convert_ensembl -h
Usage: gtftk convert_ensembl [-i GTF] [-o GTF] [-n] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Convert the GTF file to ensembl format. It will essentially add a 'transcript' feature and 'gene'
feature when required. This command can be viewed as a 'groomer' command for those starting
with a non ensembl GTF.
Notes:
* The gtftk program is designed to handle files in ensembl GTF format. This means that the GTF
file provided to gtftk must contain transcript and gene feature/lines. They will be used to get
access to transcript and gene coordinates whenever needed. This solution was chosen to define a
reference GTF file format for gtftk (since Ensembl format is probably the most widely used).
* Almost all commands of gtftk use transcript_id or gene_id as keys to perform operation on
genomic coordinates. One of the most common issue when working with gene coordinates is the
lack of non ambiguous gene or transcript names For instance, a refSeq sequence ID used as
transcript_id can be associated to several chromosomal locations as a sequence may be
duplicated. These identifiers are ambiguous and thus should be avoid. Use UCSC or ensembl IDs
instead.
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-n, --no-check-gene-chr By default the command raise an error if several chromosomes are associated with the same gene_id. Disable this behaviour (but you should better think about what it means...). (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
Note
any comment line (i.e. starting with #) or empty line in the gtf file will be ignore (discarded) by gtftk.
Naming conventions¶
Note
We will use the terms attribute or key for any descriptor found in the 9th column (e.g. transcript_id) and the term value for its associated string (e.g. “NM_334567”). The eight first columns of the GTF file (chrom/seqid, source, type, start, end, score, strand, frame) will be refered as basic attributes. In the example below, gene_id is the attribute and ‘G0001’ is the associated value.
$ gtftk get_example| gtftk select_by_key -k feature -v gene| head -1
chr1 gtftk gene 125 138 . + . gene_id "G0001";