App Catalog
Sign Up Sign In
Annotate Multiple Microbial Genomes


By: chenry, olson


Annotate or re-annotate bacterial or archaeal genomes and/or genome sets using RASTtk.

The KBase annotation app (Re-Annotate Multiple Microbial Genomes uses components from the RAST (Rapid Annotations using Subsystems Technology) toolkit [1,2,3] to annotate prokaryotic genomes, to update the annotations of genomes, or to perform computations on a set of genomes so that they are consistent. The Annotate Multiple Microbial Genomes app, which takes annotated genomes as input, allows users to re-annotate annotated genomes in order to make the annotations consistent with other KBase genomes and prepare the imported genomes for further analysis by other KBase apps. A Genome object can be generated by uploading a GenBank file, importing a GenBank file from NCBI via FTP, retrieving a Genome-typed object from KBase, or using the output of the Annotate Microbial Contigs app.

The Default Annotation Pipeline

For a typical 2-5 MBp genome, the default annotation pipeline should take about 5 minutes. Note that the default behavior of this app is to only reannotate the protein-encoding genes. The default pipeline for this app consists of the following steps:

  1. Annotate protein-encoding genes with k-mers (version 2)
    This is a set of signature k-mers (amino acid 8-mers) built from the annotations in the CoreSEED. The CoreSEED is a database of ~1,000 diverse microbial genomes and is currently the main focus of the RAST manual annotation efforts. Annotating using this k-mer set provides our most stable and best estimate of the core gene functions.
  2. Annotate remaining hypothetical proteins with k-mers (version 1)
    This set of k-mers is built from the FigFam collection in the PubSEED, which is the publically annotated version of the SEED database that consists of ~12,000 microbial genomes. The "classic" version of RAST on the RAST website ( uses the FigFam-based k-mers (hence the version 1 designation).
  3. Find close neighbors and Annotate proteins similarity
    Annotates remaining hypothetical proteins possibly missed in steps 1) and 2) by searching against close relative genomes. The search uses a combination of BLAT [4] and BLASTP [5].

Other Non-default Options

Note that most of the non-default parameters will re-call features. If you want to re-call rRNA or CDS features we highly recommend turning on the Resolve overlapping features option so that you do not end up with duplicate feature calls.

  1. Call rRNAs (default = off)
    The RAST rRNA finder calls a custom script that uses a hand-curated and phylogenetically diverse set of representative sequences of the 23S (currently 81 representatives), 16S (currently 120 representatives) and 5S (currently 292 representatives) rRNAs. These sets represent the diversity of curated genomes in the SEED. The rRNAs of a new genome are found using a BLASTN [5] search against the curated set. tRNAs are found using an implementation of tRNAscanSE, [6]. This tool uses a secondary structure based searching methodology to find the tRNA genes.
  2. Call selenoproteins (default = off)
    Selenoproteins are widespread among the sequenced bacterial and archaeal genomes. These proteins occur in ~25% of the genomes in the CoreSEED). Selenoproteins contain the rare amino acid selenocysteine, which is incorporated at a UGA stop codon in frame [7]. To find these proteins, a hand-curated set of known selenoproteins is used. Potential selenoprotein matches prompt a search for the in-frame stop codon. When a stop codon is found, the respective proteins are annotated as a selenoprotein. This is a custom BLAST-based tool.
  3. Call pyrrolysylproteins (default = off)
    Pyrrolysyl proteins are less common than selenoproteins among the currently sequenced genomes. They have been found to occur in ~1% of the sequenced bacterial and archaeal genomes in the CoreSEED. Similar to selenocysteine, pyrrolysine is incorporated at a UAG stop codon [7]. We search for pyrrolysyl proteins using the strategy described in the previous step. This is a custom BLAST-based tool.
  4. Call features repeat region SEED (default = off)
    Large repeat regions are often characteristic of horizontal gene transfer and are an indication of insertion sequences and other mobile elements. A custom script performs a BLASTN search of the genome against itself, and reports any region that occurs more than once with > 95% nucleotide identity. These precomputed repeat regions can then be used for comparative analyses and as supporting data for more detailed annotation of mobile elements.
  5. Call insertion sequences (default = off)
    The insertion sequence caller uses a reference set of end sequences and transposase proteins from the SEED [2] and ISfinder [8] databases to search the genome for IS elements. A combination of BLASTN for the end regions and BLASTX for the proteins [5] is used to find potential matches. It also looks for novel insertion sequences by searching for inverted repeats.
  6. Call features strep suis repeat and Call features strep pneumo repeat (default = conditional)
    Species in the Streptococcus genus have small interspersed repeats that may modulate gene expression. These repeats can be used for epidemiological typing [9]. RASTtk [3] implemented a set of tools created by Croucher et al. [10] specifically designed for finding these elements. This is a conditional command that will only be implemented if the genus is Streptococcus.
  7. Call CRISPR features (default = off)
    CRISPRs (clustered regularly interspaced short palindromic repeats) are a special type of repeat region found in many bacterial and archaeal genomes.This is a custom tool that uses a perl regular expression-based search to find CRISPR elements.
  8. Call the protein-encoding genes with Prodigal and Glimmer3 (default = off)
    In addition to the protein-encoding gene caller provided by default, Prodigal and Glimmer gene callers are also available. Please refer to Prodigal [11] and Glimmer [12] for more info.
  9. Retain old annotations for hypotheticals (default = off)
    In instances where the pipeline fails to find an annotation for a gene, this will retain the original annotation from the input genome typed object.
  10. Resolve overlapping features (default = off)
    Using multiple gene calling algorithms (such as Prodigal [11] and Glimmer [12]) in addition to the default gene caller can result in overlapping gene calls. This program is a custom tool that attempts to minimize overlaps and gaps to provide a set of calls that has a smaller number of gene calling errors. We do not recommend using overlap removal if you are attempting to annotate phage.
  11. Call prophage features with PhiSpy (default = off)
    To find potential prophage elements we have implemented PhiSpy [13] in the annotation pipeline. PhiSpy uses heuristic methods to identify specific regions in the genome that may be derived from phages or other mobile elements.

Team members who developed & deployed algorithm in KBase: Thomas Brettin, James Davis, Terry Disz, Robert Edwards, Chris Henry, Gary Olsen, Robert Olson, Ross Overbeek, Bruce Parrello, Gordon Pusch, Roman Sutormin, Fangfang Xia. For questions, please go to

Related Publications

App Specification:

Module Commit: 50b012d9b41b71ba31b30355627cf85f2611bc3e