Commands from section 'selection'
---------------------------------

In this section we will require the following datasets:


.. command-output:: gtftk get_example -q -d mini_real -f '*'
	:shell:

.. command-output:: gtftk get_example -q -d tiny_real -f '*'
	:shell:

.. command-output:: gtftk get_example -q -d simple -f '*'
	:shell:

select_by_key
~~~~~~~~~~~~~~~~~~~~~~

**Description:** Extract lines from the gtf based on key and values.


**Example:** Select some gene_id.

.. command-output:: gtftk select_by_key -i simple.gtf -k gene_id -v G0002,G0003,G0004
	:shell:

**Example:** Select using basic attributes (chrom, source, feature...). Note that seqid, seqname and chrom are synonymous.

.. command-output:: gtftk select_by_key -i simple.gtf -k feature -v transcript,exon | gtftk select_by_key -k seqname -v chr1
	:shell:


**Arguments:**

.. command-output:: gtftk select_by_key -h
	:shell:

------------------------------------------------------------------------------------------------------------------

select_by_regexp
~~~~~~~~~~~~~~~~~~~~~~

**Description:** Select lines by testing values of a particular key with a regular expression

**Example:** Select lines corresponding to gene_names matching the regular expression 'G.*9$'.

.. command-output:: gtftk select_by_regexp -i simple.gtf -k gene_id -r 'G.*9$'
	:shell:

**Arguments:**

.. command-output:: gtftk select_by_regexp -h
	:shell:

------------------------------------------------------------------------------------------------------------------

select_by_intron_size
~~~~~~~~~~~~~~~~~~~~~~

**Description:** Delete genes containing an intron whose size is below s. If -m is selected, any gene whose sum of intronic region length is above s is deleted. Monoexonic genes are kept.

**Example:** Some genes having transcripts containing an intron whose size is below 80 nucleotides


.. command-output:: gtftk select_by_intron_size -s 200 -vd -i tiny_real.gtf.gz | gtftk intron_sizes | gtftk tabulate -k gene_name,transcript_id,intron_sizes -Hun
	:shell:

**Arguments:**

.. command-output:: gtftk select_by_regexp -h
	:shell:

------------------------------------------------------------------------------------------------------------------

select_by_max_exon_nb
~~~~~~~~~~~~~~~~~~~~~~

**Description:** For each gene select the transcript with the highest number of exons.


**Example:** Select lines corresponding to gene_names matching the regular expression 'BCL.*'.

.. command-output:: gtftk select_by_max_exon_nb -i simple.gtf | gtftk select_by_key -t
	:shell:

**Arguments:**

.. command-output:: gtftk select_by_max_exon_nb -h
	:shell:


------------------------------------------------------------------------------------------------------------------

select_by_loc
~~~~~~~~~~~~~~~~~~~~~~

**Description:** Select transcripts/gene overlapping a given locations. A transcript is defined here as the genomic region from TSS to TTS including introns. This function will return the transcript and all its associated elements (exons, utr...) even if only a fraction (e.g intron) of the transcript is overlapping the feature. If -/-ft-type is set to 'gene' returns the gene and all its associated elements.

**Example:** Select transcripts at a given location.

.. command-output:: gtftk select_by_key -k feature -v transcript -i simple.gtf | gtftk  select_by_loc -l chr1:10-15
	:shell:

**Arguments:**

.. command-output:: gtftk select_by_loc -h
	:shell:

------------------------------------------------------------------------------------------------------------------

select_by_nb_exon
~~~~~~~~~~~~~~~~~~~~~~

**Description:** Select transcripts based on the number of exons.

**Example:**

.. command-output::  gtftk select_by_nb_exon -m 2 -i simple.gtf | gtftk nb_exons| gtftk select_by_key -t
	:shell:

**Arguments:**

.. command-output:: gtftk select_by_nb_exon -h
	:shell:


------------------------------------------------------------------------------------------------------------------


select_by_numeric_value
~~~~~~~~~~~~~~~~~~~~~~~~~

**Description:** Select lines from a GTF file based on a boolean test on numeric values.

**Example:**

.. command-output:: gtftk join_attr -i simple.gtf  -j simple.join_mat -k gene_id -m|  gtftk select_by_numeric_value -t 'start < 10 and end > 10 and S1 == 0.5555 and S2 == 0.7' -n ".,?"
	:shell:

**Arguments:**

.. command-output:: gtftk select_by_numeric_value -h
	:shell:


------------------------------------------------------------------------------------------------------------------

random_list
~~~~~~~~~~~~~~~~~~~~~~

**Description:** Select a random list of genes or transcripts.

**Example:** Select randomly 3 transcripts.

.. command-output:: gtftk random_list -n 3 -i simple.gtf | gtftk count
	:shell:


**Arguments:**

.. command-output:: gtftk random_list -h
	:shell:

------------------------------------------------------------------------------------------------------------------

random_tx
~~~~~~~~~~~~~~~~~~~~~~

**Description:** Select randomly up to m transcript for each gene.

**Example:** Select randomly 1 transcript per gene (*-m 1*).

.. command-output:: gtftk random_tx -m 1 -i simple.gtf | gtftk select_by_key -k feature -v gene,transcript| gtftk tabulate -k gene_id,transcript_id
	:shell:

**Arguments:**

.. command-output:: gtftk random_tx -h
	:shell:

------------------------------------------------------------------------------------------------------------------

rm_dup_tss
~~~~~~~~~~~~~~~~~~~~~~

**Description:** If several transcripts of a gene share the same tss, select only one.

**Example:** Use rm_dup_tss to select transcripts that will be used for mk_matrix -k 5 (see later).

.. command-output:: gtftk rm_dup_tss -i simple.gtf | gtftk select_by_key -k feature -v transcript
	:shell:


**Arguments:**

.. command-output:: gtftk rm_dup_tss -h
	:shell:


------------------------------------------------------------------------------------------------------------------

select_by_go
~~~~~~~~~~~~~~~~~~~~~~

**Description:** Select genes from a GTF file using a Gene Ontology ID (e.g GO:0050789).

**Example:** Select genes with transcription factor activity from the GTF. They could be used subsequently to test their epigenetic features (see later).

.. command-output:: # gtftk select_by_go -s hsapiens -i mini_real.gtf.gz | gtftk select_by_key -k feature -v gene | gtftk tabulate -k gene_id,gene_name -Hun | head -6
	:shell:

**Arguments:**

.. command-output:: gtftk select_by_go -h
	:shell:


------------------------------------------------------------------------------------------------------------------

select_by_tx_size
~~~~~~~~~~~~~~~~~~~~~~

**Description:** Select transcript based on their size (i.e size of mature/spliced transcript).

**Example:**

.. command-output:: gtftk feature_size -t mature_rna -i simple.gtf |  gtftk select_by_tx_size -m 14 | gtftk tabulate -n -k gene_id,transcript_id,feat_size
	:shell:


**Arguments:**

.. command-output:: gtftk select_by_tx_size -h
	:shell:

------------------------------------------------------------------------------------------------------------------

select_most_5p_tx
~~~~~~~~~~~~~~~~~~~~~~

**Description:** Select the most 5' transcript of each gene.

**Example:**

.. command-output:: gtftk select_most_5p_tx -i simple.gtf | gtftk select_by_key -k feature -v transcript| gtftk tabulate -k gene_id,transcript_id
	:shell:

**Arguments:**

.. command-output:: gtftk select_most_5p_tx -h
	:shell:

------------------------------------------------------------------------------------------------------------------

short_long
~~~~~~~~~~~~~~~~~~~~~~

**Description:** Get the shortest or longest transcript of each gene

**Example:**

.. command-output:: gtftk short_long -i simple.gtf | gtftk select_by_key -k feature -v transcript| gtftk tabulate -k gene_id,transcript_id
	:shell:

**Arguments:**

.. command-output:: gtftk short_long -h
	:shell: