Commands from section ‘selection’¶
In this section we will require the following datasets:
$ gtftk get_example -q -d mini_real -f '*'
$ gtftk get_example -q -d tiny_real -f '*'
$ gtftk get_example -q -d simple -f '*'
select_by_key¶
Description: Extract lines from the gtf based on key and values.
Example: Select some gene_id.
$ gtftk select_by_key -i simple.gtf -k gene_id -v G0002,G0003,G0004
chr1 gtftk gene 180 189 . + . gene_id "G0002";
chr1 gtftk transcript 180 189 . + . gene_id "G0002"; transcript_id "G0002T001";
chr1 gtftk exon 180 189 . + . gene_id "G0002"; transcript_id "G0002T001"; exon_id "G0002T001E001";
chr1 gtftk CDS 180 182 . + . gene_id "G0002"; transcript_id "G0002T001"; ccds_id "CDS_G0002T001";
chr1 gtftk gene 50 61 . - . gene_id "G0003";
chr1 gtftk transcript 50 61 . - . gene_id "G0003"; transcript_id "G0003T001";
chr1 gtftk exon 50 54 . - . gene_id "G0003"; transcript_id "G0003T001"; exon_id "G0003T001E001";
chr1 gtftk exon 57 61 . - . gene_id "G0003"; transcript_id "G0003T001"; exon_id "G0003T001E002";
chr1 gtftk CDS 50 52 . - . gene_id "G0003"; transcript_id "G0003T001"; ccds_id "CDS_G0003T001";
chr1 gtftk gene 65 76 . + . gene_id "G0004";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T002";
chr1 gtftk exon 65 68 . + . gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E001";
chr1 gtftk exon 71 71 . + . gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E002";
chr1 gtftk exon 74 76 . + . gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E003";
chr1 gtftk CDS 66 68 . + . gene_id "G0004"; transcript_id "G0004T002"; ccds_id "CDS_G0004T002";
chr1 gtftk CDS 71 71 . + . gene_id "G0004"; transcript_id "G0004T002"; ccds_id "CDS_G0004T002";
chr1 gtftk CDS 74 75 . + . gene_id "G0004"; transcript_id "G0004T002"; ccds_id "CDS_G0004T002";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T001";
chr1 gtftk exon 65 68 . + . gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E001";
chr1 gtftk exon 71 71 . + . gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E002";
chr1 gtftk exon 74 76 . + . gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E003";
chr1 gtftk CDS 65 67 . + . gene_id "G0004"; transcript_id "G0004T001"; ccds_id "CDS_G0004T001";
Example: Select using basic attributes (chrom, source, feature…). Note that seqid, seqname and chrom are synonymous.
$ gtftk select_by_key -i simple.gtf -k feature -v transcript,exon | gtftk select_by_key -k seqname -v chr1
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T002";
chr1 gtftk exon 125 138 . + . gene_id "G0001"; transcript_id "G0001T002"; exon_id "G0001T002E001";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T001";
chr1 gtftk exon 125 138 . + . gene_id "G0001"; transcript_id "G0001T001"; exon_id "G0001T001E001";
chr1 gtftk transcript 180 189 . + . gene_id "G0002"; transcript_id "G0002T001";
chr1 gtftk exon 180 189 . + . gene_id "G0002"; transcript_id "G0002T001"; exon_id "G0002T001E001";
chr1 gtftk transcript 50 61 . - . gene_id "G0003"; transcript_id "G0003T001";
chr1 gtftk exon 50 54 . - . gene_id "G0003"; transcript_id "G0003T001"; exon_id "G0003T001E001";
chr1 gtftk exon 57 61 . - . gene_id "G0003"; transcript_id "G0003T001"; exon_id "G0003T001E002";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T002";
chr1 gtftk exon 65 68 . + . gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E001";
chr1 gtftk exon 71 71 . + . gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E002";
chr1 gtftk exon 74 76 . + . gene_id "G0004"; transcript_id "G0004T002"; exon_id "G0004T002E003";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T001";
chr1 gtftk exon 65 68 . + . gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E001";
chr1 gtftk exon 71 71 . + . gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E002";
chr1 gtftk exon 74 76 . + . gene_id "G0004"; transcript_id "G0004T001"; exon_id "G0004T001E003";
chr1 gtftk transcript 33 47 . - . gene_id "G0005"; transcript_id "G0005T001";
chr1 gtftk exon 33 35 . - . gene_id "G0005"; transcript_id "G0005T001"; exon_id "G0005T001E001";
chr1 gtftk exon 42 47 . - . gene_id "G0005"; transcript_id "G0005T001"; exon_id "G0005T001E002";
chr1 gtftk transcript 22 35 . - . gene_id "G0006"; transcript_id "G0006T001";
chr1 gtftk exon 22 25 . - . gene_id "G0006"; transcript_id "G0006T001"; exon_id "G0006T001E001";
chr1 gtftk exon 28 30 . - . gene_id "G0006"; transcript_id "G0006T001"; exon_id "G0006T001E002";
chr1 gtftk exon 33 35 . - . gene_id "G0006"; transcript_id "G0006T001"; exon_id "G0006T001E003";
chr1 gtftk transcript 28 35 . - . gene_id "G0006"; transcript_id "G0006T002";
chr1 gtftk exon 28 30 . - . gene_id "G0006"; transcript_id "G0006T002"; exon_id "G0006T002E001";
chr1 gtftk exon 33 35 . - . gene_id "G0006"; transcript_id "G0006T002"; exon_id "G0006T002E002";
chr1 gtftk transcript 107 116 . + . gene_id "G0007"; transcript_id "G0007T001";
chr1 gtftk exon 107 116 . + . gene_id "G0007"; transcript_id "G0007T001"; exon_id "G0007T001E001";
chr1 gtftk transcript 107 116 . + . gene_id "G0007"; transcript_id "G0007T002";
chr1 gtftk exon 107 116 . + . gene_id "G0007"; transcript_id "G0007T002"; exon_id "G0007T002E001";
chr1 gtftk transcript 210 222 . - . gene_id "G0008"; transcript_id "G0008T001";
chr1 gtftk exon 210 214 . - . gene_id "G0008"; transcript_id "G0008T001"; exon_id "G0008T001E001";
chr1 gtftk exon 220 222 . - . gene_id "G0008"; transcript_id "G0008T001"; exon_id "G0008T001E002";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T002";
chr1 gtftk exon 3 14 . - . gene_id "G0009"; transcript_id "G0009T002"; exon_id "G0009T002E001";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T001";
chr1 gtftk exon 3 14 . - . gene_id "G0009"; transcript_id "G0009T001"; exon_id "G0009T001E001";
chr1 gtftk transcript 176 186 . + . gene_id "G0010"; transcript_id "G0010T001";
chr1 gtftk exon 176 186 . + . gene_id "G0010"; transcript_id "G0010T001"; exon_id "G0010T001E001";
Arguments:
$ gtftk select_by_key -h
Usage: gtftk select_by_key [-i GTF] [-o GTF] [-k KEY] [-v VALUE] [-f FILE] [-c COL] [-n] [-b] [-m NAME] [-s SEP] [-l] [-t] [-g] [-e] [-d] [-a] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Select lines from a GTF file based on attributes and associated values.
optional arguments:
-v, --value A comma-separated list of values. (default: None)
-f, --file-with-values A file containing values as a single column. (default: None)
-t, --select-transcripts A shortcuts for "-k feature -v transcript". (default: False)
-g, --select-genes A shortcuts for "-k feature -v gene". (default: False)
-e, --select-exons A shortcuts for "-k feature -v exon". (default: False)
-d, --select-cds A shortcuts for "-k feature -v CDS". (default: False)
-a, --select-start-codon A shortcuts for "-k feature -v start_codon". (default: False)
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --key The key name. (default: None)
-c, --col The column number (one-based) that contains the values in the file. File is tab-delimited. (default: 1)
-n, --invert-match Not/invert match. Select lines whose selected key is not associated with the selected values. (default: False)
-b, --bed-format Ask for bed format output. (default: False)
-m, --names If Bed output. The key(s) that should be used as name. (default: gene_id,transcript_id)
-s, --separator If Bed output. The separator to be used for separating name elements (see -n). (default: |)
-l, --log Print some statistics about selected features. To be used in conjunction with -V 1/2. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
select_by_regexp¶
Description: Select lines by testing values of a particular key with a regular expression
Example: Select lines corresponding to gene_names matching the regular expression ‘G.*9$’.
$ gtftk select_by_regexp -i simple.gtf -k gene_id -r 'G.*9$'
chr1 gtftk gene 3 14 . - . gene_id "G0009";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T002";
chr1 gtftk exon 3 14 . - . gene_id "G0009"; transcript_id "G0009T002"; exon_id "G0009T002E001";
chr1 gtftk CDS 5 10 . - . gene_id "G0009"; transcript_id "G0009T002"; ccds_id "CDS_G0009T002";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T001";
chr1 gtftk exon 3 14 . - . gene_id "G0009"; transcript_id "G0009T001"; exon_id "G0009T001E001";
chr1 gtftk CDS 3 8 . - . gene_id "G0009"; transcript_id "G0009T001"; ccds_id "CDS_G0009T001";
Arguments:
$ gtftk select_by_regexp -h
Usage: gtftk select_by_regexp [-i GTF] [-o GTF] [-k KEY] [-r regexp] [-n] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Select lines from a GTF file based on a regexp.
Notes:
* The default is to try to select feature from conventional human chromosome (chr1..chr22,
chrX and chrY) with --key set to chrom and --regexp set to "^chr[0-9XY]+$".
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --key The key name (default: chrom)
-r, --regexp The regular expression. (default: ^chr[0-9XY]+$)
-n, --invert-match Not/invert match. Selected lines whose requested key do not match the regexp. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
select_by_intron_size¶
Description: Delete genes containing an intron whose size is below s. If -m is selected, any gene whose sum of intronic region length is above s is deleted. Monoexonic genes are kept.
Example: Some genes having transcripts containing an intron whose size is below 80 nucleotides
$ gtftk select_by_intron_size -s 200 -vd -i tiny_real.gtf.gz | gtftk intron_sizes | gtftk tabulate -k gene_name,transcript_id,intron_sizes -Hun
MCAM ENST00000526992 159,98
Arguments:
$ gtftk select_by_regexp -h
Usage: gtftk select_by_regexp [-i GTF] [-o GTF] [-k KEY] [-r regexp] [-n] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Select lines from a GTF file based on a regexp.
Notes:
* The default is to try to select feature from conventional human chromosome (chr1..chr22,
chrX and chrY) with --key set to chrom and --regexp set to "^chr[0-9XY]+$".
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --key The key name (default: chrom)
-r, --regexp The regular expression. (default: ^chr[0-9XY]+$)
-n, --invert-match Not/invert match. Selected lines whose requested key do not match the regexp. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
select_by_max_exon_nb¶
Description: For each gene select the transcript with the highest number of exons.
Example: Select lines corresponding to gene_names matching the regular expression ‘BCL.*’.
$ gtftk select_by_max_exon_nb -i simple.gtf | gtftk select_by_key -t
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T002";
chr1 gtftk transcript 180 189 . + . gene_id "G0002"; transcript_id "G0002T001";
chr1 gtftk transcript 50 61 . - . gene_id "G0003"; transcript_id "G0003T001";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T002";
chr1 gtftk transcript 33 47 . - . gene_id "G0005"; transcript_id "G0005T001";
chr1 gtftk transcript 22 35 . - . gene_id "G0006"; transcript_id "G0006T001";
chr1 gtftk transcript 107 116 . + . gene_id "G0007"; transcript_id "G0007T001";
chr1 gtftk transcript 210 222 . - . gene_id "G0008"; transcript_id "G0008T001";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T002";
chr1 gtftk transcript 176 186 . + . gene_id "G0010"; transcript_id "G0010T001";
Arguments:
$ gtftk select_by_max_exon_nb -h
Usage: gtftk select_by_max_exon_nb [-i GTF] [-o GTF] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
For each gene select the transcript with the highest number of exons. If ties, select the first
encountered.
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
select_by_loc¶
Description: Select transcripts/gene overlapping a given locations. A transcript is defined here as the genomic region from TSS to TTS including introns. This function will return the transcript and all its associated elements (exons, utr…) even if only a fraction (e.g intron) of the transcript is overlapping the feature. If -/-ft-type is set to ‘gene’ returns the gene and all its associated elements.
Example: Select transcripts at a given location.
$ gtftk select_by_key -k feature -v transcript -i simple.gtf | gtftk select_by_loc -l chr1:10-15
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T002";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T001";
Arguments:
$ gtftk select_by_loc -h
Usage: gtftk select_by_loc [-i GTF] [-o GTF] (-l LOC | -f BEDFILE) [-t {transcript,gene}] [-n] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Select transcripts/gene overlapping a given locations.
Notes:
* A transcript is defined here as the genomic region from TSS to TTS including introns.
* This function will return the transcript and all its associated elements (exons, utr...)
even if only a fraction (e.g intron) of the transcript is overlapping the feature.
* If -/-ft-type is set to 'gene' returns the gene and all its associated elements.
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-l, --location List of chromosomal locations (chr:start-end[,chr:start-end]). 0-based (default: None)
-f, --location-file Bed file with chromosomal location. (default: None)
-t, --ft-type The feature of interest. (default: transcript)
-n, --invert-match Not/invert match. Select transcript not overlapping. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
select_by_nb_exon¶
Description: Select transcripts based on the number of exons.
Example:
$ gtftk select_by_nb_exon -m 2 -i simple.gtf | gtftk nb_exons| gtftk select_by_key -t
chr1 gtftk transcript 50 61 . - . gene_id "G0003"; transcript_id "G0003T001"; nb_exons "2";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T002"; nb_exons "3";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T001"; nb_exons "3";
chr1 gtftk transcript 33 47 . - . gene_id "G0005"; transcript_id "G0005T001"; nb_exons "2";
chr1 gtftk transcript 22 35 . - . gene_id "G0006"; transcript_id "G0006T001"; nb_exons "3";
chr1 gtftk transcript 28 35 . - . gene_id "G0006"; transcript_id "G0006T002"; nb_exons "2";
chr1 gtftk transcript 210 222 . - . gene_id "G0008"; transcript_id "G0008T001"; nb_exons "2";
Arguments:
$ gtftk select_by_nb_exon -h
Usage: gtftk select_by_nb_exon [-i GTF] [-o GTF] [-m min_exon_number] [-M max_exon_number] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Select transcripts based on the number of exons.
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-m, --min-exon-number Minimum number of exons. (default: 0)
-M, --max-exon-number Maximum number of exons. (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
select_by_numeric_value¶
Description: Select lines from a GTF file based on a boolean test on numeric values.
Example:
$ gtftk join_attr -i simple.gtf -j simple.join_mat -k gene_id -m| gtftk select_by_numeric_value -t 'start < 10 and end > 10 and S1 == 0.5555 and S2 == 0.7' -n ".,?"
chr1 gtftk gene 3 14 . - . gene_id "G0009"; S1 "0.5555"; S2 "0.7";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T002"; S1 "0.5555"; S2 "0.7";
chr1 gtftk exon 3 14 . - . gene_id "G0009"; transcript_id "G0009T002"; exon_id "G0009T002E001"; S1 "0.5555"; S2 "0.7";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T001"; S1 "0.5555"; S2 "0.7";
chr1 gtftk exon 3 14 . - . gene_id "G0009"; transcript_id "G0009T001"; exon_id "G0009T001E001"; S1 "0.5555"; S2 "0.7";
Arguments:
$ gtftk select_by_numeric_value -h
Usage: gtftk select_by_numeric_value [-i GTF] [-o GTF] -t test [-n na_omit] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Select lines from a GTF file based on a boolean test on numeric values.
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-t, --test The test to be applied. (default: None)
-n, --na-omit If one of the evaluated values is enclosed in this list (csv), line is skipped. (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
random_list¶
Description: Select a random list of genes or transcripts.
Example: Select randomly 3 transcripts.
$ gtftk random_list -n 3 -i simple.gtf | gtftk count
transcript 3
exon 6
CDS 5
Arguments:
$ gtftk random_list -h
Usage: gtftk random_list [-i GTF] [-o GTF] [-n NUMBER] [-t {gene,transcript}] [-s SEED] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Select a random list of genes or transcripts. Note that if transcripts are requested the 'gene'
feature is not returned.
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-n, --number The number of transcripts or gene to select. (default: 1)
-t, --ft-type The type of feature. (default: transcript)
-s, --seed-value Seed value for the random number generator. (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
random_tx¶
Description: Select randomly up to m transcript for each gene.
Example: Select randomly 1 transcript per gene (-m 1).
$ gtftk random_tx -m 1 -i simple.gtf | gtftk select_by_key -k feature -v gene,transcript| gtftk tabulate -k gene_id,transcript_id
gene_id transcript_id
G0001 G0001T001
G0002 G0002T001
G0003 G0003T001
G0004 G0004T001
G0005 G0005T001
G0006 G0006T001
G0007 G0007T002
G0008 G0008T001
G0009 G0009T002
G0010 G0010T001
Arguments:
$ gtftk random_tx -h
Usage: gtftk random_tx [-i GTF] [-o GTF] [-m MAX] [-s SEED] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Select randomly up to m transcript for each gene.
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-m, --max-transcript The maximum number of transcripts to select for each gene. (default: 1)
-s, --seed-value Seed value for the random number generator. (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
rm_dup_tss¶
Description: If several transcripts of a gene share the same tss, select only one.
Example: Use rm_dup_tss to select transcripts that will be used for mk_matrix -k 5 (see later).
$ gtftk rm_dup_tss -i simple.gtf | gtftk select_by_key -k feature -v transcript
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T001";
chr1 gtftk transcript 180 189 . + . gene_id "G0002"; transcript_id "G0002T001";
chr1 gtftk transcript 50 61 . - . gene_id "G0003"; transcript_id "G0003T001";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T001";
chr1 gtftk transcript 33 47 . - . gene_id "G0005"; transcript_id "G0005T001";
chr1 gtftk transcript 22 35 . - . gene_id "G0006"; transcript_id "G0006T001";
chr1 gtftk transcript 107 116 . + . gene_id "G0007"; transcript_id "G0007T001";
chr1 gtftk transcript 210 222 . - . gene_id "G0008"; transcript_id "G0008T001";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T001";
chr1 gtftk transcript 176 186 . + . gene_id "G0010"; transcript_id "G0010T001";
Arguments:
$ gtftk rm_dup_tss -h
Usage: gtftk rm_dup_tss [-i GTF] [-o GTF] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
If several transcripts of a gene share the same TSS, select one transcript per TSS.
Notes:
* The alphanumeric order of transcript_id is used to select the representative of a TSS.
Argument:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
select_by_go¶
Description: Select genes from a GTF file using a Gene Ontology ID (e.g GO:0050789).
Example: Select genes with transcription factor activity from the GTF. They could be used subsequently to test their epigenetic features (see later).
$ # gtftk select_by_go -s hsapiens -i mini_real.gtf.gz | gtftk select_by_key -k feature -v gene | gtftk tabulate -k gene_id,gene_name -Hun | head -6
Arguments:
$ gtftk select_by_go -h
Usage: gtftk select_by_go [-i GTF] [-o GTF] [-g go_id] (-l | -s species) [-n] [-p1 http_proxy] [-p2 https_proxy] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Select lines/genes from a GTF file using a Gene Ontology ID (e.g GO:0097194).
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-g, --go-id The GO ID (with or without "GO:" prefix). (default: GO:0003700)
-l, --list-datasets Do not select lines. Only get a list of available datasets/species. (default: False)
-s, --species The dataset/species. (default: None)
-n, --invert-match Not/invert match. (default: False)
-p1, --http-proxy Use this http proxy (not tested/experimental). (default: )
-p2, --https-proxy Use this https proxy (not tested/experimental). (default: )
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
select_by_tx_size¶
Description: Select transcript based on their size (i.e size of mature/spliced transcript).
Example:
$ gtftk feature_size -t mature_rna -i simple.gtf | gtftk select_by_tx_size -m 14 | gtftk tabulate -n -k gene_id,transcript_id,feat_size
gene_id transcript_id feat_size
G0001 G0001T002 14
G0001 G0001T001 14
Arguments:
$ gtftk select_by_tx_size -h
Usage: gtftk select_by_tx_size [-i GTF] [-o GTF] [-m min_size] [-M max_size] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Select transcript based on their size (i.e size of mature/spliced transcript).
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-m, --min-size Minimum size. (default: 0)
-M, --max-size Maximum size. (default: 1000000000)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
select_most_5p_tx¶
Description: Select the most 5’ transcript of each gene.
Example:
$ gtftk select_most_5p_tx -i simple.gtf | gtftk select_by_key -k feature -v transcript| gtftk tabulate -k gene_id,transcript_id
gene_id transcript_id
G0001 G0001T002
G0002 G0002T001
G0003 G0003T001
G0004 G0004T002
G0005 G0005T001
G0006 G0006T001
G0007 G0007T001
G0008 G0008T001
G0009 G0009T002
G0010 G0010T001
Arguments:
$ gtftk select_most_5p_tx -h
Usage: gtftk select_most_5p_tx [-i GTF] [-o GTF] [-g] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Select the most 5' transcript of each gene.
Notes:
* If several transcript share the samemost 5' TSS, only one transcript is selected.
optional arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-g, --keep-gene-lines Add gene lines to the output (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
short_long¶
Description: Get the shortest or longest transcript of each gene
Example:
$ gtftk short_long -i simple.gtf | gtftk select_by_key -k feature -v transcript| gtftk tabulate -k gene_id,transcript_id
gene_id transcript_id
G0001 G0001T002
G0002 G0002T001
G0003 G0003T001
G0004 G0004T002
G0005 G0005T001
G0006 G0006T002
G0007 G0007T001
G0008 G0008T001
G0009 G0009T002
G0010 G0010T001
Arguments:
$ gtftk short_long -h
Usage: gtftk short_long [-i GTF] [-o GTF] [-l] [-g] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Select the shortest mature transcript (i.e without introns) for each gene or the longest if the -l
arguments is used.
Notes:
*
Argument:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-l, --longs Take the longest transcript of each gene (default: False)
-g, --keep-gene-lines Add gene lines to the output (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)