Warning about supported GTF file formats
===============================================

.. warning:: Most of the commands of the gtftk suite are designed to handle files in **Ensembl** GTF format and thus require **transcript and gene features/lines** in the GTF. All lines must contain a transcript_id and gene_id value except the **gene feature** that should contain only the gene_id (**see get_example command for an example**). Transcript and gene lines will be used when required to get access to transcript and gene coordinates. This solution was chosen to define a reference GTF file format for (py)gtftk (since Ensembl format is probably the most widely used).

You can use the **convert_ensembl** subcommand to convert your non- (or old) ensembl format to current ensembl format.


Below an example in which we first select only exon features then use **convert_ensembl** to re-generate gene and transcript features using **convert_ensembl** .

.. command-output:: gtftk get_example | gtftk select_by_key -k feature  -v exon | head -n 10
	:shell:


.. command-output:: gtftk get_example | gtftk select_by_key -k feature  -v exon | gtftk  convert_ensembl | head -n 10
	:shell:

**Arguments:**

.. command-output:: gtftk convert_ensembl -h
	:shell:


.. note:: any comment line (*i.e.* starting with #) or empty line in the gtf file will be ignore (discarded) by gtftk.


Naming conventions
----------------------

.. note:: We will use the terms **attribute or key** for any descriptor found in the 9th column (*e.g.* transcript_id) and the term **value** for its associated string (e.g. "NM_334567"). The eight first columns of the GTF file (chrom/seqid, source, type, start, end, score, strand, frame) will be refered as **basic attributes**. In the example below, gene_id is the attribute and 'G0001' is the associated value.

.. command-output:: gtftk get_example| gtftk select_by_key -k feature -v gene| head -1
	:shell: