Assembly and Annotation in KBase

Assemble & AnnotateIn KBase, you can quickly and easily assemble microbial Next-Generation Sequencing (NGS) short reads into contigs and then run an automatic annotation pipeline on the assembled contigs, calling genes and other genomic features and assigning biological functions, to generate an annotated Genome object that can be used in other analyses. The Assembly & Annotation tutorial and the interactive Narrative tutorial are good ways to learn about this powerful functionality.


KBase provides pipelines for assembling microbial Next-Generation Sequencing (NGS) short reads and generating annotated genomes from these assemblies. The starting point for assembly in KBase is a set of single- or paired-end reads. KBase now supports the upload of read libraries generated from a variety of sequencing technologies, including Illumina, PacBio CLR, PacBio CSS, IonTorrent, and Oxford Nanopore. You can upload reads files from your computer or an online site (FTP, HTTP, Dropbox, or Box), or transfer microbial reads from the Joint Genome Institute.

KBase currently integrates 11 different genome assembly apps. 10 of these apps are simple wraps of existing assembly algorithms, including: A5, A6, IDBA-UD, Kiki, MaSuRCA, MEGAHIT, MiniASM, Ray, SPAdes and Velvet.  Please see the table at the bottom of this page for more information about each of these assembly apps.

The Assemble Contigs from Reads app lets users compare the quality of outputs from several different assembly programs by allowing users to select from several assembly “recipes” that combine multiple assemblers with other utilities to produce an optimal assembly. The default option is the “Automatic Assembly recipe, which runs three different assemblers (Velvet, SPAdes, and IDBA-UD), uses BayesHammer for error correction, and chooses an assembly that is suitable for most downstream analyses based on an ARAST quality score. The second recipe is the “Fast Pipeline”, which is similar to the “Automatic Pipeline” except that it uses the A6 assembler instead of IDBA-UD. Lastly, the “Smart Pipeline” recipe runs the same assemblers as the Automatic Assembly recipe, while using KmerGenie to select the best k-mer length for assembly, and BayesHammer for error correction. However, instead of using the ARAST quality score to select the best assembly, this recipe uses ALE scores to sort the assemblies by quality, and then merges the two best assemblies into a single assembly using GAM-NGS.

Screen Shot 2016-11-13 at 9.36.26 PM

The Assemble Contigs from Reads app produces an Assembly object in addition to an Assembly Report (see screenshot at right) that provides statistics about the assembly job. The Assembly Report includes information about the performance of each assembly algorithm that was tested. For example, the N50 row shows the contig length at which the sum of all contig sizes larger than or equal to this length contained in this assembly equals half the total length of the assembly. This number functions as a median value that can be used as a measure of quality for the assembly, as a larger N50 number generally correlates to a more meaningful assembly. This is because a larger N50 generally means there are fewer contigs with short lengths contained within the assembly. The Automatic Assembly recipe determines which assembly to use for annotation by comparing the ARAST quality score for each assembly. In general, the assembly with the fewest number of contigs, largest N50 value, and largest contig length will have the best ARAST quality score. Based on the summary statistics contained in the Assembly Report, a user may want to reassemble their reads using a specific assembler based on the outcomes of the Automatic Assembly recipe.


KBase’s annotation pipeline includes assignment of biological functions derived from RAST (Rapid Annotations using Subsystems Technology). The resulting annotated Genome can be exported in GenBank or FASTA formats or used as input for other KBase apps. Input to the annotation apps can be either the Assembly (assembled contigs) generated by one of the assembly apps, or an already-annotated genome (for example, from GenBank) that you want to reannotate in KBase in order to do downstream analyses such as metabolic modeling. Each app takes the input and applies two different algorithms, Prodigal and Glimmer3, to predict gene locations within the contigs. Next, the gene sequences are passed into the functional annotation pipeline, which is based on the RAST (Rapid Annotations using Subsystems Technology) toolkit. The output of the annotation apps is an annotated Genome object, which you can explore in the genome viewer.

The output of the annotation apps is an annotated Genome object, which is displayed in a tabular genome viewer (see below) that shows information about the Genome as well as a list of contigs and the genes that were called on each contig. From this table, you can bring up a landing page for the Genome with additional information about data provenance, publications related to the Genome, and biological information derived from the assembly and annotation process.  In the Genes tab, you can explore the different biological functions mapped to the Genome. To see more details about an entry under the Contigs and Genes tabs, you can open an expanded view of it:



Assembly Apps

  • Assemble Contigs from Reads – runs several different assembly programs and lets users compare the quality of outputs (see above for more information).
  • Assemble with A5A5-miseq is good for high-quality microbial genome assembly and does so without the need for parameter tuning on the part of the user. It is an integrated meta-assembly pipeline that cleans reads, performs error correction, assembles contigs, performs scaffolding and then performs misassembly correction before constructing the final scaffold.
  • Assemble with A6 – A6 is an Argonne-modified version of the original A5 microbial assembly. A6’s modifications over A5 include a bug fix in detecting Phred64 quality coding and replacing IDBA with IDBA-UD for improved assembly accuracy and stability.
  • Assemble with IDBA-UDIDBA-UD is an iterative graph-based assembler for single-cell and standard short read data and is good for data of highly uneven sequencing depth. This assembler uses an iterative approach for selecting k-mer size that compensates for the information loss associated with single k-mer based de Bruijn graphs, making IDBA-UD one of the more accurate microbial assemblers.
  • Assemble with KikiKiki is a fast, parallel microbial and metagenomic assembler that uses a hybrid of the overlap-layout-consensus strategy and greedy contig extension. Compared to de Bruijn graph-based methods, this approach allows for less information loss without the need for chopping reads into shorter k-mers.
  • Assemble with MaSuRCAMaSuRCA is a short read assembler that combines the benefits of de Bruijn graph and overlap layout consensus assembly approaches. The main concept is the creation of super-reads that contain sequence information present in the original reads, which super-reads are then extended in both directions using an efficient k-mer lookup table. MaSuRCA is one of a smaller set of assemblers biologists use for eukaryotic assembly.
  • Assemble with MEGAHITMEGAHIT is a single node assembler for large and complex metagenomics NGS reads. It makes use of succinct de Bruijn graph (SdBG) to achieve low memory assembly, making it fast and especially suitable for assembly of small metagenomes, metatranscriptomes or low-coverage data in general.
  • Assemble with MiniASMMiniASM is an ultra-fast overlap-layout-consensus based de novo assembler for noisy long reads developed. It has been shown to assemble ~50X microbial PacBio reads into a draft assembly of a small number of contigs in a matter of minutes. MiniASM derives this performance from a locality-sensitive hashing based overlapper implemented in minimap.
  • Assemble with RayRay is a parallel, graph-based microbial and metagenomic assembler. Ray improves on the standard de Bruijn graph based algorithm by continuing contig-building at the unitigs by employing greedy heuristics to extend paths, keeping track of the reads from which the k-mers came from and the read pairs from paired-end reads, and by using a repeat removal algorithm inspired by SPAdes.
  • Assemble with SPAdesSPAdes is a single-cell and standard assembler based on paired de Bruijn graphs, considered to be one of the most accurate microbial assemblers. SPAdes employs a multisized de Bruijn graph which detects and removes bubble and chimeric reads, estimates insert distance from paired kmers, and computes contigs based on paired assembly graph.
  • Assemble with VelvetVelvet is a classic de Bruijn graph based assembler that works by efficiently manipulating de Bruijn graphs through simplification and compression. It eliminates errors and resolves repeats by first using an error correction algorithm that merges sequences together. Repeats are then removed from the sequence via the repeat solver that separates paths which share local overlaps.

Assembly and Annotation Resources in KBase

Annotation Apps