Annotate or re-annotate bacterial or archaeal genome using RASTtk.
The KBase annotation apps (Annotate Microbial Contigs and Annotate Microbial Genome) use components from the RAST (Rapid Annotations using Subsystems Technology) toolkit [1,2,3] to annotate a prokaryotic genome, to update the annotations of a genome, or to perform computations on a set of genomes so that they are consistent.The Annotate Microbial Contigs app starts with unannotated microbial sequence in one or more contigs and runs it through an annotation pipeline. The Annotate Microbial Genome app, which takes an annotated genome as input, allows users to re-annotate annotated genomes in order to make the annotations consistent with other KBase genomes and prepare the imported genome for further analysis by other KBase apps. A Genome object can be generated by uploading a GenBank file, importing a GenBank file from NCBI via FTP, retrieving a Genome-typed object from KBase, or using the output of the Annotate Microbial Contigs app.
The Default Annotation Pipeline
For a typical 2-5 MBp genome, the default annotation pipeline should take about 5 minutes. Note that the default behavior of this app is to only reannotate the protein-encoding genes. The default pipeline for this app consists of the following steps:
- Annotate protein-encoding genes with k-mers (version 2)
This is a set of signature k-mers (amino acid 8-mers) built from the annotations in the CoreSEED. The CoreSEED is a database of ~1,000 diverse microbial genomes and is currently the main focus of the RAST manual annotation efforts. Annotating using this k-mer set provides our most stable and best estimate of the core gene functions.
- Annotate remaining hypothetical proteins with k-mers (version 1)
This set of k-mers is built from the FigFam collection in the PubSEED, which is the publically annotated version of the SEED database that consists of ~12,000 microbial genomes. The "classic" version of RAST on the RAST website (rast.nmpdr.org) uses the FigFam-based k-mers (hence the version 1 designation).
- Find close neighbors and Annotate proteins similarity
Annotates remaining hypothetical proteins possibly missed in steps 1) and 2) by searching against close relative genomes. The search uses a combination of BLAT  and BLASTP .
Other Non-default Options
Note that most of the non-default parameters will re-call features. If you want to re-call rRNA or CDS features we highly recommend turning on the Resolve overlapping features option so that you do not end up with duplicate feature calls.
- Call rRNAs (default = off)
The RAST rRNA finder calls a custom script that uses a hand-curated and phylogenetically diverse set of representative sequences of the 23S (currently 81 representatives), 16S (currently 120 representatives) and 5S (currently 292 representatives) rRNAs. These sets represent the diversity of curated genomes in the SEED. The rRNAs of a new genome are found using a BLASTN  search against the curated set. tRNAs are found using an implementation of tRNAscanSE, . This tool uses a secondary structure based searching methodology to find the tRNA genes.
- Call selenoproteins (default = off)
Selenoproteins are widespread among the sequenced bacterial and archaeal genomes. These proteins occur in ~25% of the genomes in the CoreSEED). Selenoproteins contain the rare amino acid selenocysteine, which is incorporated at a UGA stop codon in frame . To find these proteins, a hand-curated set of known selenoproteins is used. Potential selenoprotein matches prompt a search for the in-frame stop codon. When a stop codon is found, the respective proteins are annotated as a selenoprotein. This is a custom BLAST-based tool.
- Call pyrrolysylproteins (default = off)
Pyrrolysyl proteins are less common than selenoproteins among the currently sequenced genomes. They have been found to occur in ~1% of the sequenced bacterial and archaeal genomes in the CoreSEED. Similar to selenocysteine, pyrrolysine is incorporated at a UAG stop codon . We search for pyrrolysyl proteins using the strategy described in the previous step. This is a custom BLAST-based tool.
- Call features repeat region SEED (default = off)
Large repeat regions are often characteristic of horizontal gene transfer and are an indication of insertion sequences and other mobile elements. A custom script performs a BLASTN search of the genome against itself, and reports any region that occurs more than once with > 95% nucleotide identity. These precomputed repeat regions can then be used for comparative analyses and as supporting data for more detailed annotation of mobile elements.
- Call insertion sequences (default = off)
The insertion sequence caller uses a reference set of end sequences and transposase proteins from the SEED  and ISfinder  databases to search the genome for IS elements. A combination of BLASTN for the end regions and BLASTX for the proteins  is used to find potential matches. It also looks for novel insertion sequences by searching for inverted repeats.
- Call features strep suis repeat and Call features strep pneumo repeat (default = conditional)
Species in the Streptococcus genus have small interspersed repeats that may modulate gene expression. These repeats can be used for epidemiological typing . RASTtk  implemented a set of tools created by Croucher et al.  specifically designed for finding these elements. This is a conditional command that will only be implemented if the genus is Streptococcus.
- Call CRISPR features (default = off)
CRISPRs (clustered regularly interspaced short palindromic repeats) are a special type of repeat region found in many bacterial and archaeal genomes.This is a custom tool that uses a perl regular expression-based search to find CRISPR elements.
- Call the protein-encoding genes with Prodigal and Glimmer3 (default = off)
In addition to the protein-encoding gene caller provided by default, Prodigal and Glimmer gene callers are also available. Please refer to Prodigal  and Glimmer  for more info.
- Retain old annotations for hypotheticals (default = off)
In instances where the pipeline fails to find an annotation for a gene, this will retain the original annotation from the input genome typed object.
- Resolve overlapping features (default = off)
Using multiple gene calling algorithms (such as Prodigal  and Glimmer ) in addition to the default gene caller can result in overlapping gene calls. This program is a custom tool that attempts to minimize overlaps and gaps to provide a set of calls that has a smaller number of gene calling errors. We do not recommend using overlap removal if you are attempting to annotate phage.
- Call prophage features with PhiSpy (default = off)
To find potential prophage elements we have implemented PhiSpy  in the annotation pipeline. PhiSpy uses heuristic methods to identify specific regions in the genome that may be derived from phages or other mobile elements.
Team members who developed & deployed algorithm in KBase: Thomas Brettin, James Davis, Terry Disz, Robert Edwards, Chris Henry, Gary Olsen, Robert Olson, Ross Overbeek, Bruce Parrello, Gordon Pusch, Roman Sutormin, Fangfang Xia. For questions, please go to http://kbase.us/contact-us
-  Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, et al. The RAST Server: Rapid Annotations using Subsystems Technology. BMC Genomics. 2008;9: 75. doi:10.1186/1471-2164-9-75 , https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-9-75
-  Overbeek R, Olson R, Pusch GD, Olsen GJ, Davis JJ, Disz T, et al. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res. 2014;42: D206 D214. doi:10.1093/nar/gkt1226 , https://academic.oup.com/nar/article/42/D1/D206/1062536
-  Brettin T, Davis JJ, Disz T, Edwards RA, Gerdes S, Olsen GJ, et al. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes. Sci Rep. 2015;5. doi:10.1038/srep08365 , https://www.nature.com/articles/srep08365
-  Kent WJ. BLAT The BLAST-Like Alignment Tool. Genome Res. 2002;12: 656 664. doi:10.1101/gr.229202 , https://genome.cshlp.org/content/12/4/656
-  Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10: 421. doi:10.1186/1471-2105-10-421 , https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-421
-  Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25: 955 964. , https://www.ncbi.nlm.nih.gov/pmc/articles/PMC146525/
-  Cobucci-Ponzano B, Rossi M, Moracci M. Translational recoding in archaea. Extremophiles. 2012;16: 793 803. doi:10.1007/s00792-012-0482-8 , https://www.ncbi.nlm.nih.gov/pubmed/23015064
-  Siguier P, Perochon J, Lestrade L, Mahillon J, Chandler M. ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Res. 2006;34: D32 D36. doi:10.1093/nar/gkj014 , https://academic.oup.com/nar/article/34/suppl_1/D32/1132247
-  van Belkum A, Sluijuter M, de Groot R, Verbrugh H, Hermans PW. Novel BOX repeat PCR assay for high-resolution typing of Streptococcus pneumoniae strains. J Clin Microbiol. 1996;34: 1176 1179. , https://www.ncbi.nlm.nih.gov/pmc/articles/PMC228977/
-  Croucher NJ, Vernikos GS, Parkhill J, Bentley SD. Identification, variation and transcription of pneumococcal repeat sequences. BMC Genomics. 2011;12: 120. doi:10.1186/1471-2164-12-120 , https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-12-120
-  Hyatt D, Chen G-L, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11: 119. doi:10.1186/1471-2105-11-119 , https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-119
-  Delcher AL, Bratke KA, Powers EC, Salzberg SL. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007;23: 673 679. doi:10.1093/bioinformatics/btm009 , https://academic.oup.com/bioinformatics/article/23/6/673/419055
-  Akhter S, Aziz RK, Edwards RA. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res. 2012;40: e126. doi:10.1093/nar/gks406 , https://academic.oup.com/nar/article/40/16/e126/1027055
Module Commit: 50b012d9b41b71ba31b30355627cf85f2611bc3e