Commands from section ‘sequence’

In this section we will require the following datasets:

$ gtftk get_example -q -d simple -f '*'

get_tx_seq

Description: Get transcript sequences in fasta format.

Example: Get sequences of transcripts in 5’ to 3’ orientation

$ gtftk get_tx_seq -g simple.fa -i simple.gtf | head -n 4
>transcript|G0001T002|G0001|chr1|125|138
cccccgttacgtag
>transcript|G0001T001|G0001|chr1|125|138
cccccgttacgtag

Note that the format is rather flexible and any combination of key can be exported to the header.

$ gtftk get_tx_seq -i simple.gtf -g simple.fa  -l gene_id,transcript_id,feature,chrom,start,end,strand  | head -n 2
>G0001|G0001T002|transcript|chr1|125|138|+
cccccgttacgtag

Arguments:

$ gtftk get_tx_seq -h
  Usage: gtftk get_tx_seq [-i GTF] [-o FASTA] -g FASTA [-w] [-s SEP] [-l label] [-f] [-d] [-a assembly] [-c] [-n] [-e] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Get transcripts sequences in a flexible fasta format from a GTF file.

  Notes:
     *  The sequences are returned in 5' to 3' orientation.
     *  If you want to use wildcards, use quotes :e.g. 'foo/bar*.fa'.
     *  The first time a genome is used, an index (*.fa.gtftk) will be created in ~/.gtftk.

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output FASTA file. (default: <stdout>)
 -g, --genome                 The genome in fasta format. Accept path with wildcards (e.g. *.fa). (default: None)
 -w, --with-introns           Set to true to include intronic regions. (default: False)
 -s, --separator              To separate info in header. (default: |)
 -l, --label                  A set of key for the header. (default: feature,transcript_id,gene_id,seqid,start,end)
 -f, --sleuth-format          Produce output in sleuth format (still experimental). (default: False)
 -d, --delete-version         In case of --sleuth-format, delete gene_id or transcript_id version number (e.g '.2' in ENSG56765.2). (default: False)
 -a, --assembly               In case of --sleuth-format, an assembly version. (default: GRCm38)
 -c, --del-chr                When using --sleuth-format delete 'chr' in sequence id. (default: False)
 -n, --no-rev-comp            Don't reverse complement sequence corresponding to gene on minus strand. (default: False)
 -e, --explicit               Write explicitly the name of the keys in the header. (default: False)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)

get_feat_seq

Description: Get feature sequence (e.g exon, UTR…).

Example:

$ gtftk get_feat_seq -i simple.gtf -g simple.fa  -l feature,transcript_id,start -t  exon -n | head -10
index file simple.fa.fai not found, generating...
>exon|G0001T002|124
cccccgttacgtag
>exon|G0001T001|124
cccccgttacgtag
>exon|G0002T001|179
ggccttatta
>exon|G0003T001|49
caagc
>exon|G0003T001|56
taatt

Arguments:

$ gtftk get_feat_seq -h
  Usage: gtftk get_feat_seq [-i GTF] [-o FASTA] -g FASTA [-s separator] [-l label] [-t feature_type] [-n] [-r] [-u] [-h] [-V [verbosity]] [-D] [-C] [-K tmp_dir] [-A] [-L logger_file] [-W write_message_to_file]

  Description: 

     Get feature sequences (i.e. column 3) in a flexible fasta format from a GTF file.

  Notes:
     *  The sequences are returned in 5' to 3' orientation.
     *  If you want to use wildcards, use quotes: e.g. 'foo/bar*.fa'.
     *  See get_tx_seq for mature RNA sequence.
     *  If --unique is used if a header was already encountered the record won't be print.  Take
     care to use unambiguous identifiers in the header.

Arguments:
 -i, --inputfile              Path to the GTF file. Default to STDIN (default: <stdin>)
 -o, --outputfile             Output FASTA file. (default: <stdout>)
 -g, --genome                 The genome in fasta format. (default: None)
 -s, --separator              To separate info in header. (default: |)
 -l, --label                  A set of key for the header that will be extracted from the transcript line. (default: feature,transcript_id,gene_id,seqid,start,end)
 -t, --feature-type           The feature type (one defined in column 3). (default: exon)
 -n, --no-rev-comp            Don't reverse complement sequence corresponding to gene on minus strand. (default: False)
 -r, --rev-comp-to-header     Indicate in the header whether sequence was rev-complemented. (default: False)
 -u, --unique                 Don't write redondant IDS. (default: False)

Command-wise optional arguments:
 -h, --help                   Show this help message and exit.
 -V, --verbosity              Set output verbosity ([0-3]). (default: 0)
 -D, --no-date                Do not add date to output file names. (default: False)
 -C, --add-chr                Add 'chr' to chromosome names before printing output. (default: False)
 -K, --tmp-dir                Keep all temporary files into this folder. (default: None)
 -A, --keep-all               Try to keep all temporary files even if process does not terminate normally. (default: False)
 -L, --logger-file            Stores the arguments passed to the command into a file. (default: None)
 -W, --write-message-to-file  Store all message into a file. (default: None)