Commands from section ‘selection’

In this section we will require the following datasets:

$ gtftk get_example -q -d mini_real -f '*'
$ gtftk get_example -q -d tiny_real -f '*'
$ gtftk get_example -q -d simple -f '*'

select_by_key

Description: Extract lines from the gtf based on key and values.

Example: Select some gene_id.

$ gtftk select_by_key -i simple.gtf -k gene_id -v G0002,G0003,G0004
chr1	gtftk	gene	180	189	.	+	.	gene_id "G0002";
chr1	gtftk	transcript	180	189	.	+	.	gene_id "G0002"; transcript_id "G0002T001";
chr1	gtftk	exon	180	189	.	+	.	gene_id "G0002"; transcript_id "G0002T001"; exon_id "G0002T001E001";
chr1	gtftk	CDS	180	182	.	+	.	gene_id "G0002"; transcript_id "G0002T001"; ccds_id "CDS_G0002T001";
chr1	gtftk	gene	50	61	.	-	.	gene_id "G0003";
chr1	gtftk	transcript	50	61	.	-	.	gene_id "G0003"; transcript_id "G0003T001";
chr1	gtftk	exon	50	54	.	-	.	gene_id "G0003"; transcript_id "G0003T001"; exon_id "G0003T001E001";
chr1	gtftk	exon	57	61	.	-	.	gene_id "G0003"; transcript_id "G0003T001"; exon_id "G0003T001E002";
chr1	gtftk	CDS	50	52	.	-	.	gene_id "G0003"; transcript_id "G0003T001"; ccds_id "CDS_G0003T001";
chr1	gtftk	gene	65	76	.	+	.	gene_id "G0004";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T002";
chr1	gtftk	exon	65	68	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E001";
chr1	gtftk	exon	71	71	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E002";
chr1	gtftk	exon	74	76	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E003";
chr1	gtftk	CDS	66	68	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; ccds_id "CDS_G0004T002";
chr1	gtftk	CDS	71	71	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; ccds_id "CDS_G0004T002";
chr1	gtftk	CDS	74	75	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; ccds_id "CDS_G0004T002";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T001";
chr1	gtftk	exon	65	68	.	+	.	gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E001";
chr1	gtftk	exon	71	71	.	+	.	gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E002";
chr1	gtftk	exon	74	76	.	+	.	gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E003";
chr1	gtftk	CDS	65	67	.	+	.	gene_id "G0004"; transcript_id "G0004T001"; ccds_id "CDS_G0004T001";

Example: Select using basic attributes (chrom, source, feature…). Note that seqid, seqname and chrom are synonymous.

$ gtftk select_by_key -i simple.gtf -k feature -v transcript,exon | gtftk select_by_key -k seqname -v chr1
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002";
chr1	gtftk	exon	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002"; exon_id "G0001T002E001";
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T001";
chr1	gtftk	exon	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T001"; exon_id "G0001T001E001";
chr1	gtftk	transcript	180	189	.	+	.	gene_id "G0002"; transcript_id "G0002T001";
chr1	gtftk	exon	180	189	.	+	.	gene_id "G0002"; transcript_id "G0002T001"; exon_id "G0002T001E001";
chr1	gtftk	transcript	50	61	.	-	.	gene_id "G0003"; transcript_id "G0003T001";
chr1	gtftk	exon	50	54	.	-	.	gene_id "G0003"; transcript_id "G0003T001"; exon_id "G0003T001E001";
chr1	gtftk	exon	57	61	.	-	.	gene_id "G0003"; transcript_id "G0003T001"; exon_id "G0003T001E002";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T002";
chr1	gtftk	exon	65	68	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E001";
chr1	gtftk	exon	71	71	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E002";
chr1	gtftk	exon	74	76	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E003";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T001";
chr1	gtftk	exon	65	68	.	+	.	gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E001";
chr1	gtftk	exon	71	71	.	+	.	gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E002";
chr1	gtftk	exon	74	76	.	+	.	gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E003";
chr1	gtftk	transcript	33	47	.	-	.	gene_id "G0005"; transcript_id "G0005T001";
chr1	gtftk	exon	33	35	.	-	.	gene_id "G0005"; transcript_id "G0005T001"; exon_id "G0005T001E001";
chr1	gtftk	exon	42	47	.	-	.	gene_id "G0005"; transcript_id "G0005T001"; exon_id "G0005T001E002";
chr1	gtftk	transcript	22	35	.	-	.	gene_id "G0006"; transcript_id "G0006T001";
chr1	gtftk	exon	22	25	.	-	.	gene_id "G0006"; transcript_id "G0006T001"; exon_id "G0006T001E001";
chr1	gtftk	exon	28	30	.	-	.	gene_id "G0006"; transcript_id "G0006T001"; exon_id "G0006T001E002";
chr1	gtftk	exon	33	35	.	-	.	gene_id "G0006"; transcript_id "G0006T001"; exon_id "G0006T001E003";
chr1	gtftk	transcript	28	35	.	-	.	gene_id "G0006"; transcript_id "G0006T002";
chr1	gtftk	exon	28	30	.	-	.	gene_id "G0006"; transcript_id "G0006T002"; exon_id "G0006T002E001";
chr1	gtftk	exon	33	35	.	-	.	gene_id "G0006"; transcript_id "G0006T002"; exon_id "G0006T002E002";
chr1	gtftk	transcript	107	116	.	+	.	gene_id "G0007"; transcript_id "G0007T001";
chr1	gtftk	exon	107	116	.	+	.	gene_id "G0007"; transcript_id "G0007T001"; exon_id "G0007T001E001";
chr1	gtftk	transcript	107	116	.	+	.	gene_id "G0007"; transcript_id "G0007T002";
chr1	gtftk	exon	107	116	.	+	.	gene_id "G0007"; transcript_id "G0007T002"; exon_id "G0007T002E001";
chr1	gtftk	transcript	210	222	.	-	.	gene_id "G0008"; transcript_id "G0008T001";
chr1	gtftk	exon	210	214	.	-	.	gene_id "G0008"; transcript_id "G0008T001"; exon_id "G0008T001E001";
chr1	gtftk	exon	220	222	.	-	.	gene_id "G0008"; transcript_id "G0008T001"; exon_id "G0008T001E002";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T002";
chr1	gtftk	exon	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T002"; exon_id "G0009T002E001";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T001";
chr1	gtftk	exon	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T001"; exon_id "G0009T001E001";
chr1	gtftk	transcript	176	186	.	+	.	gene_id "G0010"; transcript_id "G0010T001";
chr1	gtftk	exon	176	186	.	+	.	gene_id "G0010"; transcript_id "G0010T001"; exon_id "G0010T001E001";

Arguments:

$ gtftk select_by_key -h
  Usage: gtftk select_by_key [-i GTF] [-o GTF] [-k KEY] [-v VALUE] [-f FILE] [-c COL] [-n] [-b] [-m NAME] [-s SEP] [-l] [-t] [-g] [-e] [-d] [-a] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Select lines from a GTF file based on attributes and associated values.

optional arguments:
 -v, --value                  A comma-separated list of values. (default: None)
 -f, --file-with-values       A file containing values as a single column. (default: None)
 -t, --select-transcripts     A shortcuts for "-k feature -v transcript". (default: False)
 -g, --select-genes           A shortcuts for "-k feature -v gene". (default: False)
 -e, --select-exons           A shortcuts for "-k feature -v exon". (default: False)
 -d, --select-cds             A shortcuts for "-k feature -v CDS". (default: False)
 -a, --select-start-codon     A shortcuts for "-k feature -v start_codon". (default: False)

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -k, --key                    The key name. (default: None)
 -c, --col                    The column number (one-based) that contains the values in the file. File is tab-delimited. (default: 1)
 -n, --invert-match           Not/invert match. Select lines whose selected key is not associated with the selected values. (default: False)
 -b, --bed-format             Ask for bed format output. (default: False)
 -m, --names                  If Bed output. The key(s) that should be used as name. (default: gene_id,transcript_id)
 -s, --separator              If Bed output. The separator to be used for separating name elements (see -n). (default: |)
 -l, --log                    Print some statistics about selected features. To be used in conjunction with -V 1/2. (default: False)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

select_by_regexp

Description: Select lines by testing values of a particular key with a regular expression

Example: Select lines corresponding to gene_names matching the regular expression ‘G.*9$’.

$ gtftk select_by_regexp -i simple.gtf -k gene_id -r 'G.*9$'
chr1	gtftk	gene	3	14	.	-	.	gene_id "G0009";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T002";
chr1	gtftk	exon	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T002"; exon_id "G0009T002E001";
chr1	gtftk	CDS	5	10	.	-	.	gene_id "G0009"; transcript_id "G0009T002"; ccds_id "CDS_G0009T002";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T001";
chr1	gtftk	exon	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T001"; exon_id "G0009T001E001";
chr1	gtftk	CDS	3	8	.	-	.	gene_id "G0009"; transcript_id "G0009T001"; ccds_id "CDS_G0009T001";

Arguments:

$ gtftk select_by_regexp -h
  Usage: gtftk select_by_regexp [-i GTF] [-o GTF] [-k KEY] [-r regexp] [-n] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Select lines from a GTF file based on a regexp.

  Notes:
     *  The default is to try to select feature from conventional human chromosome (chr1..chr22,
     chrX and chrY) with --key set to chrom and --regexp set to "^chr[0-9XY]+$".

optional arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -k, --key                    The key name (default: chrom)
 -r, --regexp                 The regular expression. (default: ^chr[0-9XY]+$)
 -n, --invert-match           Not/invert match. Selected lines whose requested key do not match the regexp. (default: False)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

select_by_intron_size

Description: Delete genes containing an intron whose size is below s. If -m is selected, any gene whose sum of intronic region length is above s is deleted. Monoexonic genes are kept.

Example: Some genes having transcripts containing an intron whose size is below 80 nucleotides

$ gtftk select_by_intron_size -s 200 -vd -i tiny_real.gtf.gz | gtftk intron_sizes | gtftk tabulate -k gene_name,transcript_id,intron_sizes -Hun
MCAM	ENST00000526992	159,98

Arguments:

$ gtftk select_by_regexp -h
  Usage: gtftk select_by_regexp [-i GTF] [-o GTF] [-k KEY] [-r regexp] [-n] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Select lines from a GTF file based on a regexp.

  Notes:
     *  The default is to try to select feature from conventional human chromosome (chr1..chr22,
     chrX and chrY) with --key set to chrom and --regexp set to "^chr[0-9XY]+$".

optional arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -k, --key                    The key name (default: chrom)
 -r, --regexp                 The regular expression. (default: ^chr[0-9XY]+$)
 -n, --invert-match           Not/invert match. Selected lines whose requested key do not match the regexp. (default: False)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

select_by_max_exon_nb

Description: For each gene select the transcript with the highest number of exons.

Example: Select lines corresponding to gene_names matching the regular expression ‘BCL.*’.

$ gtftk select_by_max_exon_nb -i simple.gtf | gtftk select_by_key -t
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T002";
chr1	gtftk	transcript	180	189	.	+	.	gene_id "G0002"; transcript_id "G0002T001";
chr1	gtftk	transcript	50	61	.	-	.	gene_id "G0003"; transcript_id "G0003T001";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T002";
chr1	gtftk	transcript	33	47	.	-	.	gene_id "G0005"; transcript_id "G0005T001";
chr1	gtftk	transcript	22	35	.	-	.	gene_id "G0006"; transcript_id "G0006T001";
chr1	gtftk	transcript	107	116	.	+	.	gene_id "G0007"; transcript_id "G0007T001";
chr1	gtftk	transcript	210	222	.	-	.	gene_id "G0008"; transcript_id "G0008T001";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T002";
chr1	gtftk	transcript	176	186	.	+	.	gene_id "G0010"; transcript_id "G0010T001";

Arguments:

$ gtftk select_by_max_exon_nb -h
  Usage: gtftk select_by_max_exon_nb [-i GTF] [-o GTF] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     For each gene select the transcript with the highest number of exons. If ties, select the first
     encountered.

optional arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

select_by_loc

Description: Select transcripts/gene overlapping a given locations. A transcript is defined here as the genomic region from TSS to TTS including introns. This function will return the transcript and all its associated elements (exons, utr…) even if only a fraction (e.g intron) of the transcript is overlapping the feature. If -/-ft-type is set to ‘gene’ returns the gene and all its associated elements.

Example: Select transcripts at a given location.

$ gtftk select_by_key -k feature -v transcript -i simple.gtf | gtftk  select_by_loc -l chr1:10-15
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T002";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T001";

Arguments:

$ gtftk select_by_loc -h
  Usage: gtftk select_by_loc [-i GTF] [-o GTF] (-l LOC | -f BEDFILE) [-t {transcript,gene}] [-n] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Select transcripts/gene overlapping a given locations.

  Notes:
     *  A transcript is defined here as the genomic region from TSS to TTS including introns.
     *  This function will return the transcript and all its associated elements (exons, utr...)
     even if only a fraction (e.g intron) of the transcript is overlapping the feature.
     *  If -/-ft-type is set to 'gene' returns the gene and all its associated elements.

optional arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -l, --location               List of chromosomal locations (chr:start-end[,chr:start-end]). 0-based (default: None)
 -f, --location-file          Bed file with chromosomal location. (default: None)
 -t, --ft-type                The feature of interest. (default: transcript)
 -n, --invert-match           Not/invert match. Select transcript not overlapping. (default: False)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

select_by_nb_exon

Description: Select transcripts based on the number of exons.

Example:

$ gtftk select_by_nb_exon -m 2 -i simple.gtf | gtftk nb_exons| gtftk select_by_key -t
chr1	gtftk	transcript	50	61	.	-	.	gene_id "G0003"; transcript_id "G0003T001"; nb_exons "2";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T002"; nb_exons "3";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T001"; nb_exons "3";
chr1	gtftk	transcript	33	47	.	-	.	gene_id "G0005"; transcript_id "G0005T001"; nb_exons "2";
chr1	gtftk	transcript	22	35	.	-	.	gene_id "G0006"; transcript_id "G0006T001"; nb_exons "3";
chr1	gtftk	transcript	28	35	.	-	.	gene_id "G0006"; transcript_id "G0006T002"; nb_exons "2";
chr1	gtftk	transcript	210	222	.	-	.	gene_id "G0008"; transcript_id "G0008T001"; nb_exons "2";

Arguments:

$ gtftk select_by_nb_exon -h
  Usage: gtftk select_by_nb_exon [-i GTF] [-o GTF] [-m min_exon_number] [-M max_exon_number] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Select transcripts based on the number of exons.

optional arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -m, --min-exon-number        Minimum number of exons. (default: 0)
 -M, --max-exon-number        Maximum number of exons. (default: None)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

select_by_numeric_value

Description: Select lines from a GTF file based on a boolean test on numeric values.

Example:

$ gtftk join_attr -i simple.gtf  -j simple.join_mat -k gene_id -m|  gtftk select_by_numeric_value -t 'start < 10 and end > 10 and S1 == 0.5555 and S2 == 0.7' -n ".,?"
chr1	gtftk	gene	3	14	.	-	.	gene_id "G0009"; S1 "0.5555"; S2 "0.7";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T002"; S1 "0.5555"; S2 "0.7";
chr1	gtftk	exon	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T002"; exon_id "G0009T002E001"; S1 "0.5555"; S2 "0.7";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T001"; S1 "0.5555"; S2 "0.7";
chr1	gtftk	exon	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T001"; exon_id "G0009T001E001"; S1 "0.5555"; S2 "0.7";

Arguments:

$ gtftk select_by_numeric_value -h
  Usage: gtftk select_by_numeric_value [-i GTF] [-o GTF] -t test [-n na_omit] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Select lines from a GTF file based on a boolean test on numeric values.

optional arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -t, --test                   The test to be applied. (default: None)
 -n, --na-omit                If one of the evaluated values is enclosed in this list (csv), line is skipped. (default: None)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

random_list

Description: Select a random list of genes or transcripts.

Example: Select randomly 3 transcripts.

$ gtftk random_list -n 3 -i simple.gtf | gtftk count
transcript	3
exon	6
CDS	5

Arguments:

$ gtftk random_list -h
  Usage: gtftk random_list [-i GTF] [-o GTF] [-n NUMBER] [-t {gene,transcript}] [-s SEED] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Select a random list of genes or transcripts. Note that if transcripts are requested the 'gene'
     feature is not returned.

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -n, --number                 The number of transcripts or gene to select. (default: 1)
 -t, --ft-type                The type of feature. (default: transcript)
 -s, --seed-value             Seed value for the random number generator. (default: None)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

random_tx

Description: Select randomly up to m transcript for each gene.

Example: Select randomly 1 transcript per gene (-m 1).

$ gtftk random_tx -m 1 -i simple.gtf | gtftk select_by_key -k feature -v gene,transcript| gtftk tabulate -k gene_id,transcript_id
gene_id	transcript_id
G0001	G0001T001
G0002	G0002T001
G0003	G0003T001
G0004	G0004T001
G0005	G0005T001
G0006	G0006T001
G0007	G0007T002
G0008	G0008T001
G0009	G0009T002
G0010	G0010T001

Arguments:

$ gtftk random_tx -h
  Usage: gtftk random_tx [-i GTF] [-o GTF] [-m MAX] [-s SEED] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Select randomly up to m transcript for each gene.

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -m, --max-transcript         The maximum number of transcripts to select for each gene. (default: 1)
 -s, --seed-value             Seed value for the random number generator. (default: None)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

rm_dup_tss

Description: If several transcripts of a gene share the same tss, select only one.

Example: Use rm_dup_tss to select transcripts that will be used for mk_matrix -k 5 (see later).

$ gtftk rm_dup_tss -i simple.gtf | gtftk select_by_key -k feature -v transcript
chr1	gtftk	transcript	125	138	.	+	.	gene_id "G0001"; transcript_id "G0001T001";
chr1	gtftk	transcript	180	189	.	+	.	gene_id "G0002"; transcript_id "G0002T001";
chr1	gtftk	transcript	50	61	.	-	.	gene_id "G0003"; transcript_id "G0003T001";
chr1	gtftk	transcript	65	76	.	+	.	gene_id "G0004"; transcript_id "G0004T001";
chr1	gtftk	transcript	33	47	.	-	.	gene_id "G0005"; transcript_id "G0005T001";
chr1	gtftk	transcript	22	35	.	-	.	gene_id "G0006"; transcript_id "G0006T001";
chr1	gtftk	transcript	107	116	.	+	.	gene_id "G0007"; transcript_id "G0007T001";
chr1	gtftk	transcript	210	222	.	-	.	gene_id "G0008"; transcript_id "G0008T001";
chr1	gtftk	transcript	3	14	.	-	.	gene_id "G0009"; transcript_id "G0009T001";
chr1	gtftk	transcript	176	186	.	+	.	gene_id "G0010"; transcript_id "G0010T001";

Arguments:

$ gtftk rm_dup_tss -h
  Usage: gtftk rm_dup_tss [-i GTF] [-o GTF] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     If several transcripts of a gene share the same TSS, select one transcript per TSS.

  Notes:
     *  The alphanumeric order of transcript_id is used to select the representative of a TSS.

Argument:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

select_by_go

Description: Select genes from a GTF file using a Gene Ontology ID (e.g GO:0050789).

Example: Select genes with transcription factor activity from the GTF. They could be used subsequently to test their epigenetic features (see later).

$ # gtftk select_by_go -s hsapiens -i mini_real.gtf.gz | gtftk select_by_key -k feature -v gene | gtftk tabulate -k gene_id,gene_name -Hun | head -6

Arguments:

$ gtftk select_by_go -h
  Usage: gtftk select_by_go [-i GTF] [-o GTF] [-g go_id] (-l | -s species) [-n] [-p1 http_proxy] [-p2 https_proxy] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Select lines/genes from a GTF file using a Gene Ontology ID (e.g GO:0097194).

optional arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -g, --go-id                  The GO ID (with or without "GO:" prefix). (default: GO:0003700)
 -l, --list-datasets          Do not select lines. Only get a list of available datasets/species. (default: False)
 -s, --species                The dataset/species. (default: None)
 -n, --invert-match           Not/invert match. (default: False)
 -p1, --http-proxy            Use this http proxy (not tested/experimental). (default: )
 -p2, --https-proxy           Use this https proxy (not tested/experimental). (default: )

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

select_by_tx_size

Description: Select transcript based on their size (i.e size of mature/spliced transcript).

Example:

$ gtftk feature_size -t mature_rna -i simple.gtf |  gtftk select_by_tx_size -m 14 | gtftk tabulate -n -k gene_id,transcript_id,feat_size
gene_id	transcript_id	feat_size
G0001	G0001T002	14
G0001	G0001T001	14

Arguments:

$ gtftk select_by_tx_size -h
  Usage: gtftk select_by_tx_size [-i GTF] [-o GTF] [-m min_size] [-M max_size] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Select transcript based on their size (i.e size of mature/spliced transcript).

optional arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -m, --min-size               Minimum size. (default: 0)
 -M, --max-size               Maximum size. (default: 1000000000)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

select_most_5p_tx

Description: Select the most 5’ transcript of each gene.

Example:

$ gtftk select_most_5p_tx -i simple.gtf | gtftk select_by_key -k feature -v transcript| gtftk tabulate -k gene_id,transcript_id
gene_id	transcript_id
G0001	G0001T002
G0002	G0002T001
G0003	G0003T001
G0004	G0004T002
G0005	G0005T001
G0006	G0006T001
G0007	G0007T001
G0008	G0008T001
G0009	G0009T002
G0010	G0010T001

Arguments:

$ gtftk select_most_5p_tx -h
  Usage: gtftk select_most_5p_tx [-i GTF] [-o GTF] [-g] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Select the most 5' transcript of each gene.

  Notes:
     *  If several transcript share the samemost 5' TSS, only one transcript is selected.

optional arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -g, --keep-gene-lines        Add gene lines to the output (default: False)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

short_long

Description: Get the shortest or longest transcript of each gene

Example:

$ gtftk short_long -i simple.gtf | gtftk select_by_key -k feature -v transcript| gtftk tabulate -k gene_id,transcript_id
gene_id	transcript_id
G0001	G0001T002
G0002	G0002T001
G0003	G0003T001
G0004	G0004T002
G0005	G0005T001
G0006	G0006T002
G0007	G0007T001
G0008	G0008T001
G0009	G0009T002
G0010	G0010T001

Arguments:

$ gtftk short_long -h
  Usage: gtftk short_long [-i GTF] [-o GTF] [-l] [-g] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Select the shortest mature transcript (i.e without introns) for each gene or the longest if the -l
     arguments is used.

  Notes:
     *

Argument:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output file. (default: <stdout>)
 -l, --longs                  Take the longest transcript of each gene (default: False)
 -g, --keep-gene-lines        Add gene lines to the output (default: False)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)