Annotate Multiple Microbial Genomes with RASTtk - v1.073

Back to the catalog

Annotate or re-annotate bacterial or archaeal genomes and/or genome sets using RASTtk (Rapid Annotations using Subsystems Technology toolkit).

This KBase annotation App (Annotate Multiple Microbial Genomes uses components from the RAST (Rapid Annotations using Subsystems Technology) toolkit [1,2,3] to annotate prokaryotic genomes, to update the annotations of genomes, or to perform computations on a set of genomes so that they are consistent. The newly generated genomes will have the same names as the input genomes with .RAST appended.

The release versions of the RASTtk component services used in this app are:

kb_seed: tag 20200922
kmer_annotation_figfam: tag 20200922
genome_annotation: tag 20200922

The Annotate Multiple Microbial Genomes App, takes genomes as input and allows users to annotate or re-annotate the genomes. This will make the annotations consistent with other KBase genomes and prepare the genomes for further analysis by other KBase Apps, especially the Metabolic Modeling Apps. A Genome object can be generated by uploading a GenBank file, importing a GenBank file from NCBI via FTP, retrieving a Genome-typed object from KBase, or using the output of the Annotate Microbial Assembly App.

A Genome object can be imported or generated with one of the following annotation Apps or their multi-object versions:

The Default Annotation Pipeline
Clicking "Run" will run the default pipeline. For a typical 2-5 MBp genome, the default annotation pipeline should take about 5 minutes. It is assumed that Genomes already have some annotation. As a result, the default behavior of this App is to use SEED to re-annotate just the protein-encoding genes. The default pipeline for this App consists of the following steps:

DNA/RNA-based predictions
1. Call rRNAs (default = off)
  Predict rRNAs in the genome. This is a custom BLAST-based tool for finding rRNAs.
2. Call tRNAs with tRNAscan (default = off)
  Predict tRNAs in the genome with tRNAscan-SE [6].
3. Call CRISPRs (default = off)
  This is a custom tool that uses a perl regular expression-based search to find CRISPR elements.
4. Find prophage elements with phispy (default = off)
  This will use the phispy program to find prophage elements [13].
Gene predictions
1. Call protein-encoding genes with both Prodigal [11] and Glimmer3 [12] (default = off)
  These options will delete all existing genes in the genome object and replace them with the selected predictions.
2. Call selenoproteins and pyrrolysylproteins [7] (default = off)
  These are custom BLAST-based tools.
Repeats
1. Call SEED large repeat regions (default = off)
  This is a BLASTn search within the genome for regions greater than 95% nucleotide similarity greater than or equal to 100bp in length.
2. Find Streptococcus repeat regions [9, 10] (default = off)
  This is a command that should only be implemented if the genus is Streptococcus.
Add SEED Functions/Annotation to protein-encoding genes (k-mers needed for Metabolic Modeling)
1. Annotate protein-encoding genes with k-mers (version 2; default = on)
  This is a set of signature k-mers (amino acid 8-mers) built from the annotations in the CoreSEED. The CoreSEED is a database of ~1,000 diverse microbial genomes and is currently the main focus of the RAST manual annotation efforts. Annotating using this k-mer set provides the user with our most stable and best estimate of the core gene functions.
2. Annotate remaining hypothetical proteins with k-mers (version 1; default = on)
  This set of k-mers is built from the FigFam collection [8] in the PubSEED, which is the publically annotated version of the SEED database that consists of ~12,000 microbial genomes. The "classic" version of RAST on the RAST website (http://rast.nmpdr.org) uses the FigFam-based k-mers (hence the version 1 designation).
3. Annotate remaining hypothetical proteins by protein similarity (default = on)
  We have several non-redundant databases for the most common genera. If the genus name of your organism matches one of these, a search will be performed against the remaining hypothetical proteins to attempt to find a function. The search uses a combination of BLAT [4] and BLAST [5].
Other
1. Perform a basic gene overlap removal (default = off)
  Using multiple gene calling algorithms can result in overlapping gene calls. This program is a custom tool that attempts to minimize overlaps and gaps to provide a set of calls that has a smaller number of gene calling errors. We do not recommend using overlap removal if you are attempting to annotate phage.
2. Retain old annotations for hypotheticals (default = off)
  In instances where the pipeline fails to find an annotation for a gene, this will retain the original annotation from the input Genome-typed object.

Advanced Annotation Options
If you wish to customize the features in your annotation, click the "show advanced options" link. This will display the full set of available annotation options.

The Results

The Objects section has a table of all the data objects that were created by the App. Click on the name of the data object to open a data viewer cell (below the currently selected cell).
The Summary section gives details about the coding and noncoding features that were created and the average protein length.
The Files section has a downloadable version of the Summary.

GUI Output
The GUI output currently consists of three tabs. The "Overview" tab provides basic information on the annotation job, the "Browse Features" tab allows the user to scroll through the features that were called, and the "Browse Contigs" tab provides information on the contigs in the genome. Users can sort on the various types of features. Note that some features will overlap (e.g., "prophage" and "CDS").

Additional Information
For more information on the steps of the default RAStk pipeline please refer to our publication on this (publication forthcoming). For more detailed tutorial information and to explore the additional functionality of RASTtk not currently available in the Narrative interface please refer to http://tutorial.theseed.org.

Team members who developed & deployed algorithm in KBase: Thomas Brettin, James Davis, Terry Disz, Robert Edwards, Chris Henry, Gary Olsen, Robert Olson, Ross Overbeek, Bruce Parrello, Gordon Pusch, Roman Sutormin, and Fangfang Xia. For questions, please contact us.

The authors of RAST request that if you use the results of this annotation in your work, please cite the first three listed publications:

Related Publications

[1] Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, et al. The RAST Server: Rapid Annotations using Subsystems Technology. BMC Genomics. 2008;9: 75. doi:10.1186/1471-2164-9-75 , https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-9-75
[2] Overbeek R, Olson R, Pusch GD, Olsen GJ, Davis JJ, Disz T, et al. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res. 2014;42: D206 D214. doi:10.1093/nar/gkt1226 , https://academic.oup.com/nar/article/42/D1/D206/1062536
[3] Brettin T, Davis JJ, Disz T, Edwards RA, Gerdes S, Olsen GJ, et al. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes. Sci Rep. 2015;5. doi:10.1038/srep08365 , https://www.nature.com/articles/srep08365
[4] Kent WJ. BLAT The BLAST-Like Alignment Tool. Genome Res. 2002;12: 656 664. doi:10.1101/gr.229202 , https://genome.cshlp.org/content/12/4/656
[5] Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25: 3389-3402. doi:10.1093/nar/25.17.3389 ,
[6] Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25: 955 964. , https://www.ncbi.nlm.nih.gov/pmc/articles/PMC146525/
[7] Cobucci-Ponzano B, Rossi M, Moracci M. Translational recoding in archaea. Extremophiles. 2012;16: 793 803. doi:10.1007/s00792-012-0482-8 , https://www.ncbi.nlm.nih.gov/pubmed/23015064
[8] Meyer F, Overbeek R, Rodriguez A. FIGfams: yet another set of protein families. Nucleic Acids Res. 2009;37 6643-54. doi:10.1093/nar/gkp698. , https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2777423/
[9] van Belkum A, Sluijuter M, de Groot R, Verbrugh H, Hermans PW. Novel BOX repeat PCR assay for high-resolution typing of Streptococcus pneumoniae strains. J Clin Microbiol. 1996;34: 1176 1179. , https://www.ncbi.nlm.nih.gov/pmc/articles/PMC228977/
[10] Croucher NJ, Vernikos GS, Parkhill J, Bentley SD. Identification, variation and transcription of pneumococcal repeat sequences. BMC Genomics. 2011;12: 120. doi:10.1186/1471-2164-12-120 , https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-12-120
[11] Hyatt D, Chen G-L, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11: 119. doi:10.1186/1471-2105-11-119 , https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-119
[12] Delcher AL, Bratke KA, Powers EC, Salzberg SL. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007;23: 673 679. doi:10.1093/bioinformatics/btm009 , https://academic.oup.com/bioinformatics/article/23/6/673/419055
[13] Akhter S, Aziz RK, Edwards RA. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res. 2012;40: e126. doi:10.1093/nar/gks406 , https://academic.oup.com/nar/article/40/16/e126/1027055

App Specification:

https://github.com/kbaseapps/RAST_SDK/tree/7171090d87fccc8b7ecf1a1d02398995dcc2dd45/ui/narrative/methods/reannotate_microbial_genomes

Module Commit: 7171090d87fccc8b7ecf1a1d02398995dcc2dd45