Commands from section ‘Editing’¶
In this section we will require the following datasets:
$ gtftk get_example -q -d simple -f '*'
$ gtftk get_example -q -d mini_real -f "*"
add_prefix¶
Description: Add a prefix (or suffix) to one of the attribute value (e.g. gene_id)
Example:
$ gtftk add_prefix -i simple.gtf -k transcript_id -t "novel_"| head -2
chr1 gtftk gene 125 138 . + . gene_id "G0001";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "novel_G0001T002";
Arguments:
$ gtftk add_prefix -h
Usage: gtftk add_prefix [-i GTF] [-o GTF] [-k KEY] [-t TEXT] [-s] [-f target_feature] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Add a prefix to target values. By default add 'chr' to seqid/chromosome key.
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --key The name of the attribute for which a prefix/suffix is to be added to the corresponding values (e.g, gene_id, transcript_id...). (default: chrom)
-t, --text The character string to add as a prefix to the values. (default: chr)
-s, --suffix The character string to add as a prefix to the values. (default: False)
-f, --target-feature The name of the target feature. (default: *)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
del_attr¶
Description: Delete an attribute and its corresponding values.
Example:
$ gtftk del_attr -i simple.gtf -k transcript_id,exon_id
chr1 gtftk gene 125 138 . + . gene_id "G0001";
chr1 gtftk transcript 125 138 . + . gene_id "G0001";
chr1 gtftk exon 125 138 . + . gene_id "G0001";
chr1 gtftk CDS 125 130 . + . gene_id "G0001"; ccds_id "CDS_G0001T002";
chr1 gtftk transcript 125 138 . + . gene_id "G0001";
chr1 gtftk exon 125 138 . + . gene_id "G0001";
chr1 gtftk CDS 130 132 . + . gene_id "G0001"; ccds_id "CDS_G0001T001";
chr1 gtftk gene 180 189 . + . gene_id "G0002";
chr1 gtftk transcript 180 189 . + . gene_id "G0002";
chr1 gtftk exon 180 189 . + . gene_id "G0002";
chr1 gtftk CDS 180 182 . + . gene_id "G0002"; ccds_id "CDS_G0002T001";
chr1 gtftk gene 50 61 . - . gene_id "G0003";
chr1 gtftk transcript 50 61 . - . gene_id "G0003";
chr1 gtftk exon 50 54 . - . gene_id "G0003";
chr1 gtftk exon 57 61 . - . gene_id "G0003";
chr1 gtftk CDS 50 52 . - . gene_id "G0003"; ccds_id "CDS_G0003T001";
chr1 gtftk gene 65 76 . + . gene_id "G0004";
chr1 gtftk transcript 65 76 . + . gene_id "G0004";
chr1 gtftk exon 65 68 . + . gene_id "G0004";
chr1 gtftk exon 71 71 . + . gene_id "G0004";
chr1 gtftk exon 74 76 . + . gene_id "G0004";
chr1 gtftk CDS 66 68 . + . gene_id "G0004"; ccds_id "CDS_G0004T002";
chr1 gtftk CDS 71 71 . + . gene_id "G0004"; ccds_id "CDS_G0004T002";
chr1 gtftk CDS 74 75 . + . gene_id "G0004"; ccds_id "CDS_G0004T002";
chr1 gtftk transcript 65 76 . + . gene_id "G0004";
chr1 gtftk exon 65 68 . + . gene_id "G0004";
chr1 gtftk exon 71 71 . + . gene_id "G0004";
chr1 gtftk exon 74 76 . + . gene_id "G0004";
chr1 gtftk CDS 65 67 . + . gene_id "G0004"; ccds_id "CDS_G0004T001";
chr1 gtftk gene 33 47 . - . gene_id "G0005";
chr1 gtftk transcript 33 47 . - . gene_id "G0005";
chr1 gtftk exon 33 35 . - . gene_id "G0005";
chr1 gtftk exon 42 47 . - . gene_id "G0005";
chr1 gtftk CDS 43 45 . - . gene_id "G0005"; ccds_id "CDS_G0005T001";
chr1 gtftk gene 22 35 . - . gene_id "G0006";
chr1 gtftk transcript 22 35 . - . gene_id "G0006";
chr1 gtftk exon 22 25 . - . gene_id "G0006";
chr1 gtftk exon 28 30 . - . gene_id "G0006";
chr1 gtftk exon 33 35 . - . gene_id "G0006";
chr1 gtftk CDS 22 25 . - . gene_id "G0006"; ccds_id "CDS_G0006T001";
chr1 gtftk CDS 28 30 . - . gene_id "G0006"; ccds_id "CDS_G0006T001";
chr1 gtftk CDS 33 34 . - . gene_id "G0006"; ccds_id "CDS_G0006T001";
chr1 gtftk transcript 28 35 . - . gene_id "G0006";
chr1 gtftk exon 28 30 . - . gene_id "G0006";
chr1 gtftk exon 33 35 . - . gene_id "G0006";
chr1 gtftk CDS 29 30 . - . gene_id "G0006"; ccds_id "CDS_G0006T002";
chr1 gtftk CDS 33 33 . - . gene_id "G0006"; ccds_id "CDS_G0006T002";
chr1 gtftk gene 107 116 . + . gene_id "G0007";
chr1 gtftk transcript 107 116 . + . gene_id "G0007";
chr1 gtftk exon 107 116 . + . gene_id "G0007";
chr1 gtftk CDS 112 114 . + . gene_id "G0007"; ccds_id "CDS_G0007T001";
chr1 gtftk transcript 107 116 . + . gene_id "G0007";
chr1 gtftk exon 107 116 . + . gene_id "G0007";
chr1 gtftk CDS 110 115 . + . gene_id "G0007"; ccds_id "CDS_G0007T002";
chr1 gtftk gene 210 222 . - . gene_id "G0008";
chr1 gtftk transcript 210 222 . - . gene_id "G0008";
chr1 gtftk exon 210 214 . - . gene_id "G0008";
chr1 gtftk exon 220 222 . - . gene_id "G0008";
chr1 gtftk CDS 211 213 . - . gene_id "G0008"; ccds_id "CDS_G0008T001";
chr1 gtftk gene 3 14 . - . gene_id "G0009";
chr1 gtftk transcript 3 14 . - . gene_id "G0009";
chr1 gtftk exon 3 14 . - . gene_id "G0009";
chr1 gtftk CDS 5 10 . - . gene_id "G0009"; ccds_id "CDS_G0009T002";
chr1 gtftk transcript 3 14 . - . gene_id "G0009";
chr1 gtftk exon 3 14 . - . gene_id "G0009";
chr1 gtftk CDS 3 8 . - . gene_id "G0009"; ccds_id "CDS_G0009T001";
chr1 gtftk gene 176 186 . + . gene_id "G0010";
chr1 gtftk transcript 176 186 . + . gene_id "G0010";
chr1 gtftk exon 176 186 . + . gene_id "G0010";
chr1 gtftk CDS 184 186 . + . gene_id "G0010"; ccds_id "CDS_G0010T001";
Arguments:
$ gtftk del_attr -h
Usage: gtftk del_attr [-i GTF] [-o GTF] -k KEY [-r] [-v] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Delete one or several attributes from the gtf file.
Notes:
* You may also use 'complex' regexp such as : "(_id)|(_b.*pe)"
* Example: gtftk get_example -d mini_real | gtftk del_attr -k "(^.*_id$|^.*_biotype$)" -r -v
* TODO: currently a segfault is thrown when no keys are left after deletion (libgtftk issue
#98).
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --key comma-separated list of attribute names or a regular expression (see -r). (default: None)
-r, --reg-exp The key name is a regular expression. (default: False)
-v, --invert-match Delected keys are those not matching any of the specified key. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
Note
In version 1.0.9 gene_id can not be deleted. This will be fixed in next version.
join_attr¶
Description: Add attributes from a file. This command can be used to import additional key/values into the gtf (e.g CPAT for coding potential, DESeq for differential analysis…). The imported file can be in 2 formats (2 columns or matrix):
With a 2-columns file:
value for joining (transcript_id or gene_id…).
corresponding value.
With a matrix (see -m):
rows corresponding to joining keys (transcript_id or gene_id or…).
columns corresponding to novel attributes name.
Each cell of the matrix is a value for the corresponding attribute.
Example: With a 2-columns file.
$ cat simple.join
G0003 0.2322
G0004 0.999
G0009 0.5555
$ gtftk join_attr -i simple.gtf -k gene_id -j simple.join -n a_score -t gene| gtftk select_by_key -k feature -v gene
chr1 gtftk gene 125 138 . + . gene_id "G0001";
chr1 gtftk gene 180 189 . + . gene_id "G0002";
chr1 gtftk gene 50 61 . - . gene_id "G0003"; a_score "0.2322";
chr1 gtftk gene 65 76 . + . gene_id "G0004"; a_score "0.999";
chr1 gtftk gene 33 47 . - . gene_id "G0005";
chr1 gtftk gene 22 35 . - . gene_id "G0006";
chr1 gtftk gene 107 116 . + . gene_id "G0007";
chr1 gtftk gene 210 222 . - . gene_id "G0008";
chr1 gtftk gene 3 14 . - . gene_id "G0009"; a_score "0.5555";
chr1 gtftk gene 176 186 . + . gene_id "G0010";
Example: With a matrix-like file.
$ cat simple.join_mat
genes S1 S2
G0003 0.2322 0.4
G0004 0.999 0.6
G0009 0.5555 0.7
$ gtftk join_attr -i simple.gtf -k gene_id -j simple.join_mat -m -t gene| gtftk select_by_key -k feature -v gene
chr1 gtftk gene 125 138 . + . gene_id "G0001";
chr1 gtftk gene 180 189 . + . gene_id "G0002";
chr1 gtftk gene 50 61 . - . gene_id "G0003"; S1 "0.2322"; S2 "0.4";
chr1 gtftk gene 65 76 . + . gene_id "G0004"; S1 "0.999"; S2 "0.6";
chr1 gtftk gene 33 47 . - . gene_id "G0005";
chr1 gtftk gene 22 35 . - . gene_id "G0006";
chr1 gtftk gene 107 116 . + . gene_id "G0007";
chr1 gtftk gene 210 222 . - . gene_id "G0008";
chr1 gtftk gene 3 14 . - . gene_id "G0009"; S1 "0.5555"; S2 "0.7";
chr1 gtftk gene 176 186 . + . gene_id "G0010";
Arguments:
$ gtftk join_attr -h
Usage: gtftk join_attr [-i GTF] [-o GTF] -k KEY -j JOIN_FILE [-H] [-m] [-n NEW_KEY] [-t target_feature] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Join attributes from a tabulated file.
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --key-to-join The name of the key used to join (e.g transcript_id). (default: None)
-j, --join-file A two columns file with (i) the value for joining (e.g value for transcript_id), (ii) the value for novel key (e.g the coding potential computed value). (default: None)
-H, --has-header Indicates that the 'join-file' has a header. (default: False)
-m, --matrix 'join-file' expect a matrix with row names as target keys column names as novel key and each cell as value. (default: False)
-n, --new-key The name of the novel key. Mutually exclusive with --matrix. (default: None)
-t, --target-feature The name(s) of the target feature(s). comma-separated. (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
join_multi_file¶
Description: Join attributes from mutiple files.
Example: Add key/value to gene features.
$ cat simple.join_mat_2
genes S3 S4
G0003 A B
G0004 C D
G0009 E F
$ cat simple.join_mat_3
genes S5 S6
G0003 0.2322 0.4
G0004 0.999 0.6
G0009 0.5555 0.7
G0009 20 30
G0004 0.999 0.6
$ gtftk join_multi_file -i simple.gtf -k gene_id -t gene -m simple.join_mat_2 simple.join_mat_3| gtftk select_by_key -g
chr1 gtftk gene 125 138 . + . gene_id "G0001";
chr1 gtftk gene 180 189 . + . gene_id "G0002";
chr1 gtftk gene 50 61 . - . gene_id "G0003"; S3 "A"; S4 "B"; S5 "0.2322"; S6 "0.4";
chr1 gtftk gene 65 76 . + . gene_id "G0004"; S3 "C"; S4 "D"; S5 "0.999|0.999"; S6 "0.6|0.6";
chr1 gtftk gene 33 47 . - . gene_id "G0005";
chr1 gtftk gene 22 35 . - . gene_id "G0006";
chr1 gtftk gene 107 116 . + . gene_id "G0007";
chr1 gtftk gene 210 222 . - . gene_id "G0008";
chr1 gtftk gene 3 14 . - . gene_id "G0009"; S3 "E"; S4 "F"; S5 "0.5555|20"; S6 "0.7|30";
chr1 gtftk gene 176 186 . + . gene_id "G0010";
Arguments:
$ gtftk join_multi_file -h
Usage: gtftk join_multi_file [-i GTF] [-o GTF] -k KEY [-t target_feature] [-m matrix_files [matrix_files ...]] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Join attributes from mutiple files.
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --key-to-join The name of the key used to join (e.g transcript_id). (default: None)
-t, --target-feature The name(s) of the target feature(s). Comma-separated. (default: None)
-m, --matrix-files A set of matrix files with row names as target keys column names as novel key and each cell as value. (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
merge_attr¶
Description: Merge a set of attributes into a destination attribute.
Example: Merge gene_id and transcript_id into a new key associated to transcript features.
$ gtftk merge_attr -i simple.gtf -k transcript_id,gene_id -d txgn_id -s "|" -f transcript | gtftk select_by_key -t
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T002"; txgn_id "G0001T002|G0001";
chr1 gtftk transcript 125 138 . + . gene_id "G0001"; transcript_id "G0001T001"; txgn_id "G0001T001|G0001";
chr1 gtftk transcript 180 189 . + . gene_id "G0002"; transcript_id "G0002T001"; txgn_id "G0002T001|G0002";
chr1 gtftk transcript 50 61 . - . gene_id "G0003"; transcript_id "G0003T001"; txgn_id "G0003T001|G0003";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T002"; txgn_id "G0004T002|G0004";
chr1 gtftk transcript 65 76 . + . gene_id "G0004"; transcript_id "G0004T001"; txgn_id "G0004T001|G0004";
chr1 gtftk transcript 33 47 . - . gene_id "G0005"; transcript_id "G0005T001"; txgn_id "G0005T001|G0005";
chr1 gtftk transcript 22 35 . - . gene_id "G0006"; transcript_id "G0006T001"; txgn_id "G0006T001|G0006";
chr1 gtftk transcript 28 35 . - . gene_id "G0006"; transcript_id "G0006T002"; txgn_id "G0006T002|G0006";
chr1 gtftk transcript 107 116 . + . gene_id "G0007"; transcript_id "G0007T001"; txgn_id "G0007T001|G0007";
chr1 gtftk transcript 107 116 . + . gene_id "G0007"; transcript_id "G0007T002"; txgn_id "G0007T002|G0007";
chr1 gtftk transcript 210 222 . - . gene_id "G0008"; transcript_id "G0008T001"; txgn_id "G0008T001|G0008";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T002"; txgn_id "G0009T002|G0009";
chr1 gtftk transcript 3 14 . - . gene_id "G0009"; transcript_id "G0009T001"; txgn_id "G0009T001|G0009";
chr1 gtftk transcript 176 186 . + . gene_id "G0010"; transcript_id "G0010T001"; txgn_id "G0010T001|G0010";
Arguments:
$ gtftk join_multi_file -h
Usage: gtftk join_multi_file [-i GTF] [-o GTF] -k KEY [-t target_feature] [-m matrix_files [matrix_files ...]] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Join attributes from mutiple files.
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --key-to-join The name of the key used to join (e.g transcript_id). (default: None)
-t, --target-feature The name(s) of the target feature(s). Comma-separated. (default: None)
-m, --matrix-files A set of matrix files with row names as target keys column names as novel key and each cell as value. (default: None)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
discretize_key¶
Description: Create a new key by discretizing a numeric key. This can be helpful to create new classes of features on the fly. The default is to create equally spaced interval. The intervals can also be created by computing the percentiles (-p) which will provide balanced classes most suitable generally.
Example: Let say we have the following matrix giving expression level of genes (rows) in samples (columns). We could join this information to the GTF and later choose to transform key S1 into a new discretized key S1_d. We may apply particular labels to this factor using -l.
$ cat simple.join_mat
genes S1 S2
G0003 0.2322 0.4
G0004 0.999 0.6
G0009 0.5555 0.7
$ gtftk join_attr -i simple.gtf -j simple.join_mat -k gene_id -m | gtftk discretize_key -k S1 -d S1_d -n 2 -l A,B | gtftk select_by_key -k feature -v gene
|-- 10:51-INFO-discretize_key : Categories: ['A', 'B']
chr1 gtftk gene 125 138 . + . gene_id "G0001";
chr1 gtftk gene 180 189 . + . gene_id "G0002";
chr1 gtftk gene 50 61 . - . gene_id "G0003"; S1 "0.2322"; S2 "0.4"; S1_d "A";
chr1 gtftk gene 65 76 . + . gene_id "G0004"; S1 "0.999"; S2 "0.6"; S1_d "B";
chr1 gtftk gene 33 47 . - . gene_id "G0005";
chr1 gtftk gene 22 35 . - . gene_id "G0006";
chr1 gtftk gene 107 116 . + . gene_id "G0007";
chr1 gtftk gene 210 222 . - . gene_id "G0008";
chr1 gtftk gene 3 14 . - . gene_id "G0009"; S1 "0.5555"; S2 "0.7"; S1_d "A";
chr1 gtftk gene 176 186 . + . gene_id "G0010";
Example: We want to load RNA-seq data in the GTF and discretize the expression values according to deciles (-p and -n set to 10). Classes will be labeled from A to J. The example below shows how balanced these classes will be.
See also
The profile command that could be used to asses the associated epigenetic marks of these 10 gene classes.
$ gtftk join_attr -i mini_real.gtf.gz -H -j mini_real_counts_ENCFF630HEX.tsv -k gene_name -n exprs -t gene | gtftk discretize_key -k exprs -p -d exprs_class -n 10 -l A,B,C,D,E,F,G,H,I,J | gtftk tabulate -k exprs_class -Hn | sort | uniq -c
|-- 10:51-INFO-discretize_key : Categories: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
96 A
83 B
89 C
91 D
88 E
88 F
90 G
89 H
89 I
90 J
Arguments:
$ gtftk discretize_key -h
Usage: gtftk discretize_key [-i GTF] [-o GTF] -k src_key -d dest_key -n KEY [-l labels] [-p] [-g] [-u] [-r precision] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Create a new key by discretizing a numeric key. This can be helpful to create new classes on the
fly that can be used subsequently.
Notes:
* if --ft-type is not set the destination key will be assigned to all feature containing the
source key.
* Non-numeric value for source key will be translated into 'NA'.
* The default is to create equally spaced interval. The interval can also be created by
computing the percentiles (-p).
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output file. (default: <stdout>)
-k, --src-key The name of the source key (default: None)
-d, --dest-key The name of the target key. (default: None)
-n, --nb-levels The number of levels/classes to create. (default: 2)
-l, --labels A comma-separated list of labels of size --nb-levels. (default: None)
-p, --percentiles Compute --nb-levels classes using percentiles. (default: False)
-g, --log Compute breaks based on log-scale. (default: False)
-u, --percentiles-of-uniq Compute percentiles based on non-redondant values. (default: False)
-r, --precision The precision used in naming intervals. (default: 2)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)