Commands from section ‘sequence’¶
In this section we will require the following datasets:
$ gtftk get_example -q -d simple -f '*'
get_tx_seq¶
Description: Get transcript sequences in fasta format.
Example: Get sequences of transcripts in 5’ to 3’ orientation
$ gtftk get_tx_seq -g simple.fa -i simple.gtf | head -n 4
>transcript|G0001T002|G0001|chr1|125|138
cccccgttacgtag
>transcript|G0001T001|G0001|chr1|125|138
cccccgttacgtag
Note that the format is rather flexible and any combination of key can be exported to the header.
$ gtftk get_tx_seq -i simple.gtf -g simple.fa -l gene_id,transcript_id,feature,chrom,start,end,strand | head -n 2
>G0001|G0001T002|transcript|chr1|125|138|+
cccccgttacgtag
Arguments:
$ gtftk get_tx_seq -h
Usage: gtftk get_tx_seq [-i GTF] [-o FASTA] -g FASTA [-w] [-s SEP] [-l label] [-f] [-d] [-a assembly] [-c] [-n] [-e] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Get transcripts sequences in a flexible fasta format from a GTF file.
Notes:
* The sequences are returned in 5' to 3' orientation.
* If you want to use wildcards, use quotes :e.g. 'foo/bar*.fa'.
* The first time a genome is used, an index (*.fa.gtftk) will be created in ~/.gtftk.
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output FASTA file. (default: <stdout>)
-g, --genome The genome in fasta format. Accept path with wildcards (e.g. *.fa). (default: None)
-w, --with-introns Set to true to include intronic regions. (default: False)
-s, --separator To separate info in header. (default: |)
-l, --label A set of key for the header. (default: feature,transcript_id,gene_id,seqid,start,end)
-f, --sleuth-format Produce output in sleuth format (still experimental). (default: False)
-d, --delete-version In case of --sleuth-format, delete gene_id or transcript_id version number (e.g '.2' in ENSG56765.2). (default: False)
-a, --assembly In case of --sleuth-format, an assembly version. (default: GRCm38)
-c, --del-chr When using --sleuth-format delete 'chr' in sequence id. (default: False)
-n, --no-rev-comp Don't reverse complement sequence corresponding to gene on minus strand. (default: False)
-e, --explicit Write explicitly the name of the keys in the header. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)
get_feat_seq¶
Description: Get feature sequence (e.g exon, UTR…).
Example:
$ gtftk get_feat_seq -i simple.gtf -g simple.fa -l feature,transcript_id,start -t exon -n | head -10
index file simple.fa.fai not found, generating...
>exon|G0001T002|124
cccccgttacgtag
>exon|G0001T001|124
cccccgttacgtag
>exon|G0002T001|179
ggccttatta
>exon|G0003T001|49
caagc
>exon|G0003T001|56
taatt
Arguments:
$ gtftk get_feat_seq -h
Usage: gtftk get_feat_seq [-i GTF] [-o FASTA] -g FASTA [-s separator] [-l label] [-t feature_type] [-n] [-r] [-u] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]
Description:
Get feature sequences (i.e. column 3) in a flexible fasta format from a GTF file.
Notes:
* The sequences are returned in 5' to 3' orientation.
* If you want to use wildcards, use quotes: e.g. 'foo/bar*.fa'.
* See get_tx_seq for mature RNA sequence.
* If --unique is used if a header was already encountered the record won't be print. Take
care to use unambiguous identifiers in the header.
Arguments:
-i, --inputfile Path to the GTF file. Default to STDIN (default: <stdin>)
-o, --outputfile Output FASTA file. (default: <stdout>)
-g, --genome The genome in fasta format. (default: None)
-s, --separator To separate info in header. (default: |)
-l, --label A set of key for the header that will be extracted from the transcript line. (default: feature,transcript_id,gene_id,seqid,start,end)
-t, --feature-type The feature type (one defined in column 3). (default: exon)
-n, --no-rev-comp Don't reverse complement sequence corresponding to gene on minus strand. (default: False)
-r, --rev-comp-to-header Indicate in the header whether sequence was rev-complemented. (default: False)
-u, --unique Don't write redondant IDS. (default: False)
Command-wise optional arguments:
-h, --help Show this help message and exit.
-V, --verbosity Set output verbosity ([0-3]). (default: 0)
-D, --no-date Do not add date to output file names. (default: False)
-C, --add-chr Add 'chr' to chromosome names before printing output. (default: False)
-K, --tmp-dir Keep all temporary files into this folder. (default: None)
-A, --keep-all Try to keep all temporary files even if process does not terminate normally. (default: False)
-L, --logger-file Stores the arguments passed to the command into a file. (default: None)
-W, --write-message-to-file Store all message into a file. (default: None)