In KBase, you can quickly and easily assemble microbial Next-Generation Sequencing (NGS) short reads into contigs and then run an automatic annotation pipeline on the assembled contigs, calling genes and other genomic features and assigning biological functions, to generate an annotated Genome object that can be used in other analyses. The Assembly & Annotation tutorial and the interactive Narrative tutorial are good ways to learn about this powerful functionality.
KBase provides pipelines for assembling microbial Next-Generation Sequencing (NGS) short reads and generating annotated genomes from these assemblies. The starting point for assembly in KBase is a set of single- or paired-end reads. KBase now supports the upload of read libraries generated from a variety of sequencing technologies, including Illumina, PacBio CLR, PacBio CSS, IonTorrent, and Oxford Nanopore. You can upload reads files from your computer or an online site (FTP, HTTP, Dropbox, or Box), or transfer microbial reads from the Joint Genome Institute.
KBase currently integrates 11 different genome assembly apps. 10 of these apps are simple wraps of existing assembly algorithms, including: A5, A6, IDBA-UD, Kiki, MaSuRCA, MEGAHIT, MiniASM, Ray, SPAdes and Velvet. Please see the table at the bottom of this page for more information about each of these assembly apps.
The Assemble Contigs from Reads app lets users compare the quality of outputs from several different assembly programs by allowing users to select from several assembly “recipes” that combine multiple assemblers with other utilities to produce an optimal assembly. The default option is the “Automatic Assembly“ recipe, which runs three different assemblers (Velvet, SPAdes, and IDBA-UD), uses BayesHammer for error correction, and chooses an assembly that is suitable for most downstream analyses based on an ARAST quality score. The second recipe is the “Fast Pipeline”, which is similar to the “Automatic Pipeline” except that it uses the A6 assembler instead of IDBA-UD. Lastly, the “Smart Pipeline” recipe runs the same assemblers as the Automatic Assembly recipe, while using KmerGenie to select the best k-mer length for assembly, and BayesHammer for error correction. However, instead of using the ARAST quality score to select the best assembly, this recipe uses ALE scores to sort the assemblies by quality, and then merges the two best assemblies into a single assembly using GAM-NGS.
The Assemble Contigs from Reads app produces an Assembly object in addition to an Assembly Report (see screenshot at right) that provides statistics about the assembly job. The Assembly Report includes information about the performance of each assembly algorithm that was tested. For example, the N50 row shows the contig length at which the sum of all contig sizes larger than or equal to this length contained in this assembly equals half the total length of the assembly. This number functions as a median value that can be used as a measure of quality for the assembly, as a larger N50 number generally correlates to a more meaningful assembly. This is because a larger N50 generally means there are fewer contigs with short lengths contained within the assembly. The Automatic Assembly recipe determines which assembly to use for annotation by comparing the ARAST quality score for each assembly. In general, the assembly with the fewest number of contigs, largest N50 value, and largest contig length will have the best ARAST quality score. Based on the summary statistics contained in the Assembly Report, a user may want to reassemble their reads using a specific assembler based on the outcomes of the Automatic Assembly recipe.
KBase’s annotation pipeline includes assignment of biological functions derived from RAST (Rapid Annotations using Subsystems Technology). The resulting annotated Genome can be exported in GenBank or FASTA formats or used as input for other KBase apps. Input to the annotation apps can be either the Assembly (assembled contigs) generated by one of the assembly apps, or an already-annotated genome (for example, from GenBank) that you want to reannotate in KBase in order to do downstream analyses such as metabolic modeling. Each app takes the input and applies two different algorithms, Prodigal and Glimmer3, to predict gene locations within the contigs. Next, the gene sequences are passed into the functional annotation pipeline, which is based on the RAST (Rapid Annotations using Subsystems Technology) toolkit. The output of the annotation apps is an annotated Genome object, which you can explore in the genome viewer.
The output of the annotation apps is an annotated Genome object, which is displayed in a tabular genome viewer (see below) that shows information about the Genome as well as a list of contigs and the genes that were called on each contig. From this table, you can bring up a landing page for the Genome with additional information about data provenance, publications related to the Genome, and biological information derived from the assembly and annotation process. In the Genes tab, you can explore the different biological functions mapped to the Genome. To see more details about an entry under the Contigs and Genes tabs, you can open an expanded view of it: