Annotate Microbial Assembly with RASTtk - v1.073

Annotate a bacterial or archaeal assembly using RASTtk (Rapid Annotations using Subsystems Technology toolkit).

This KBase annotation App (Annotate Microbial Assembly) uses components from the RAST (Rapid Annotations using Subsystems Technology) toolkit [1,2,3] to annotate an assembled bacterial or archaeal genome.

The release versions of the RASTtk component services used in this app are:

kb_seed: tag 20200922
kmer_annotation_figfam: tag 20200922
genome_annotation: tag 20200922

The required input is an Assembly object (older Narratives used the term ContigSet object). An Assembly can be generated using any of the Assembly Apps, by importing a FASTA file, or by uploading an .faa file directly from NCBI via FTP (Upload File to Staging from Web) and then importing.

Assemblies have three essential metadata fields that must be completed: scientific name, domain, and genetic code. The default genetic code for bacterial and archaeal genomes is genetic code 11. KBase annotation also supports genetic code 4 for Mycoplasma and genetic code 25. For more information on genetic codes, please refer to this NCBI document. All metadata fields are required because they affect conditional parameters in various programs that are being run. Use existing scientific names whenever possible.

The App annotates the Assembly-typed object (a set of contigs) and generates a Genome-typed object with both coding and non-coding features. By definition, Assembly objects have no annotation (only sequence) and the default is to select nearly all of the available App options. The available annotation features are in the advanced parameters and are discussed in more detail below.

For addition help, view this Tutorial for Annotate Microbial Contigs.

The Default Annotation Pipeline
Clicking "Run" will run the default pipeline. For a typical 2-5 MBp genome, this should take about 5 minutes. Because this is the first annotation for this assembly, the default pipeline consists of the following steps:

DNA/RNA-based predictions
1. Call rRNAs (default = on)
  Predict rRNAs in the genome. This is a custom BLAST-based tool for finding rRNAs.
2. Call tRNAs with tRNAscan (default = on)
  Predict tRNAs in the genome with tRNAscan-SE [6].
3. Call CRISPRs (default = on)
  This is a custom tool that uses a perl regular expression-based search to find CRISPR elements.
4. Find prophage elements with phispy (default = off)
  This will use the phispy program to find prophage elements [13].
Gene predictions
1. Call protein-encoding genes with both Prodigal [11] and Glimmer3 [12] (default = on)
2. Call selenoproteins and pyrrolysylproteins [7] (default = on)
  These are custom BLAST-based tools.
Repeats
1. Call SEED large repeat regions (default = on)
  This is a BLASTn search within the genome for regions greater than 95% nucleotide similarity greater than or equal to 100bp in length.
2. Find Streptococcus repeat regions [9, 10] (default = on)
  This is a command that should only be implemented if the genus is Streptococcus.
Add SEED Functions/Annotation to protein-encoding genes (k-mers needed for Metabolic Modeling)
1. Annotate protein-encoding genes with k-mers (version 2; default = on)
  This is a set of signature k-mers (amino acid 8-mers) built from the annotations in the CoreSEED. The CoreSEED is a database of ~1,000 diverse microbial genomes and is currently the main focus of the RAST manual annotation efforts. Annotating using this k-mer set provides the user with our most stable and best estimate of the core gene functions.
2. Annotate remaining hypothetical proteins with k-mers (version 1; default = on)
  This set of k-mers is built from the FigFam collection [8] in the PubSEED, which is the publically annotated version of the SEED database that consists of ~12,000 microbial genomes. The "classic" version of RAST on the RAST website (http://rast.nmpdr.org) uses the FigFam-based k-mers (hence the version 1 designation).
3. Annotate remaining hypothetical proteins by protein similarity (default = on)
  We have several non-redundant databases for the most common genera. If the genus name of your organism matches one of these, a search will be performed against the remaining hypothetical proteins to attempt to find a function. The search uses a combination of BLAT [4] and BLAST [5].
Other
1. Perform a basic gene overlap removal (default = on)
  Using multiple gene calling algorithms can result in overlapping gene calls. This program is a custom tool that attempts to minimize overlaps and gaps to provide a set of calls that has a smaller number of gene calling errors. We do not recommend using overlap removal if you are attempting to annotate phage.
2. Retain old annotations for hypotheticals (default = off)
  In instances where the pipeline fails to find an annotation for a gene, this will retain the original annotation from the input Genome-typed object.

Advanced Annotation Options
If you wish to customize the features in your annotation, click the "show advanced options" link. This will display the full set of available annotation options. The "Call features prophage phispy" option is unchecked because it is slower.

The Results

The Objects section has a table of all the data objects that were created by the App. Click on the name of the data object to open a data viewer cell (below the currently selected cell).
The Summary section gives details about the coding and noncoding features that were created and the average protein length.

GUI Output
The GUI output currently consists of three tabs. The "Overview" tab provides basic information on the annotation job, the "Browse Features" tab allows the user to scroll through the features that were called, and the "Browse Contigs" tab provides information on the contigs in the genome. Users can sort on the various types of features. Note that some features will overlap (e.g., "prophage" and "CDS").

Additional Information
For more information on the steps of the default RAStk pipeline, please refer to our publication on this (publication forthcoming). For more detailed tutorial information and to explore the additional functionality of RASTtk not currently available in the Narrative interface please refer to http://tutorial.theseed.org.

Team members who developed & deployed algorithm in KBase: Thomas Brettin, James Davis, Terry Disz, Robert Edwards, Chris Henry, Gary Olsen, Robert Olson, Ross Overbeek, Bruce Parrello, Gordon Pusch, Roman Sutormin, and Fangfang Xia. For questions, please contact us.

The authors of RAST request that if you use the results of this annotation in your work, please cite the first three listed publications:

Related Publications

[1] Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, et al. The RAST Server: Rapid Annotations using Subsystems Technology. BMC Genomics. 2008;9: 75. doi:10.1186/1471-2164-9-75 , https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-9-75
[2] Overbeek R, Olson R, Pusch GD, Olsen GJ, Davis JJ, Disz T, et al.vThe SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res. 2014;42: D206 D214. doi:10.1093/nar/gkt1226 , https://academic.oup.com/nar/article/42/D1/D206/1062536
[3] Brettin T, Davis JJ, Disz T, Edwards RA, Gerdes S, Olsen GJ, et al. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes. Sci Rep. 2015;5. doi:10.1038/srep08365 , https://www.nature.com/articles/srep08365
[4] Kent WJ. BLAT The BLAST-Like Alignment Tool. Genome Res. 2002;12: 656 664. doi:10.1101/gr.229202 , https://genome.cshlp.org/content/12/4/656
[5] Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25: 3389-3402. doi:10.1093/nar/25.17.3389 , https://www.ncbi.nlm.nih.gov/pmc/articles/PMC146917/
[6] Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25: 955 964. , https://www.ncbi.nlm.nih.gov/pmc/articles/PMC146525/
[7] Cobucci-Ponzano B, Rossi M, Moracci M. Translational recoding in archaea. Extremophiles. 2012;16: 793 803. doi:10.1007/s00792-012-0482-8 , https://www.ncbi.nlm.nih.gov/pubmed/23015064
[8] Meyer F, Overbeek R, Rodriguez A. FIGfams: yet another set of protein families. Nucleic Acids Res. 2009;37 6643-54. doi:10.1093/nar/gkp698. , https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2777423/
[9] van Belkum A, Sluijuter M, de Groot R, Verbrugh H, Hermans PW. Novel BOX repeat PCR assay for high-resolution typing of Streptococcus pneumoniae strains. J Clin Microbiol. 1996;34: 1176 1179. , https://www.ncbi.nlm.nih.gov/pmc/articles/PMC228977/
[10] Croucher NJ, Vernikos GS, Parkhill J, Bentley SD. Identification, variation and transcription of pneumococcal repeat sequences. BMC Genomics. 2011;12: 120. doi:10.1186/1471-2164-12-120 , https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-12-120
[11] Hyatt D, Chen G-L, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11: 119. doi:10.1186/1471-2105-11-119 , https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-119
[12] Delcher AL, Bratke KA, Powers EC, Salzberg SL. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007;23: 673 679. doi:10.1093/bioinformatics/btm009 , https://academic.oup.com/bioinformatics/article/23/6/673/419055
[13] Akhter S, Aziz RK, Edwards RA. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res. 2012;40: e126. doi:10.1093/nar/gks406 , https://academic.oup.com/nar/article/40/16/e126/1027055

App Specification:

https://github.com/kbaseapps/RAST_SDK/tree/7171090d87fccc8b7ecf1a1d02398995dcc2dd45/ui/narrative/methods/annotate_contigset

Module Commit: 7171090d87fccc8b7ecf1a1d02398995dcc2dd45