Generated March 31, 2022

Genome Extraction from Shotgun Metagenome Sequence Data

This Tutorial will guide the user through the process of obtaining high-quality genomes and phylogenetic placement from a metagenome assembly.

Authors: Dylan Chivian (DCChivian@lbl.gov), Mikayla Clark (clarkmm1@ornl.gov), and Sean Jungbluth (sjungbluth@lbl.gov)

Genome Extraction from Shotgun Metagenome Sequence Data - thumbnail workflow

Read Hygiene Classify Taxonomy Assemble Compare Contigs Bin Contigs Optimize Bins Bin Quality Assessment Extract Individual Assemblies Annotate Genomes Classify Taxonomy Find Relatives Functional Profiling

Overview

KBase has powerful tools for metabolic modeling and comparative phylogenomics of microbial genomes that can be used for developing mechanistic understanding of functional interactions between species in microbial ecosystems. Essential to this process is obtaining high-quality genomes to annotate, either via cultivation or genome extraction from metagenome assembly. KBase has a suite of microbiome analysis Apps meant to be used in concert. After assembly and binning, high-quality bins are annotated and can then be used in Comparative Phylogenomics analyses (see Narrative here) and Metabolic Reconstruction and Community Interaction Modeling (see Narrative here).

Below we present the processing of two related Compost Enrichment Metagenomes (37A & 37B) [Ionic Liquids Impact the Bioenergy Feedstock-Degrading Microbiome and Transcription of Enzymes Relevant to Polysaccharide Hydrolysis] from the Joint BioEnergy Institute (JBEI).

Main Lessons from this Narrative

  • Learn how to perform Quality Control of read libraries
  • Learn how to measure taxonomic population structure of environmental shotgun reads
  • Learn how to assemble metagenomes
  • Learn how to compare assembly quality
  • Learn how to bin metagenomic scaffolds into putative lineages (metagenome-assembled genomes, or MAGs)
  • Learn how to assess MAG quality and extract to individual genome assembly objects
  • Learn how to annotate genes to obtain KBase genomes which can be used by KBase analysis Apps
  • Learn how to place MAGs into reference species tree

A Word on Timing

While KBase boasts faster processing and run time on many apps over competitors, please keep in mind that large data sets do take time to analyze. Queue times for metagenomic data sets can appear lengthy during periods of high traffic on the servers. Once through the queue, we hope you enjoy our faster run times of hours and days for complex algorithms over the months it can take using other avenues.

As an example, the table below show displays the queue time, run time, and average run time for a selection of apps used in this tutorital.

Note: Queue and run times are from within this Narrative while the average run time is calculated from all jobs run across KBase with that particular app.

App Queue Time Run Time Average Run Time
Kaiju 1s 1h 37m 1h 3m
Assemble with metaSPAdes 11h 8m 7hr 13m 9h
Assemble Reads with MEGAHIT 1s 6h 2m 2h 58m
Assemble with IDBA-UD 3h 19m 7h 41m 4h 41m
MaxBin2 Contig Binning 2s 4h 19m 1h 13m

Description of Apps

  • FastQC 2 allows users to check the quality of raw sequence data generated by high throughput sequencing pipelines. The results legends differentiate among normal (green tick), slightly abnormal (orange exclamation), and very unusual (red cross) reads.
  • Trimmomatic 3 performs a variety of useful trimming tasks for paired- or single-end Illumina reads improving the overall quality of the data.
  • Kaiju 4,5 generates fast and sensitive taxonomic classification for metagenomic reads by comparing sequences to databases of known microbial proteins. It also generates an interactive metagenomic visualization chart.
  • metaSPAdes 6 assembles metagenomic reads using the SPAdes assembler.
  • MEGAHIT 7 assembles metagenomic reads using the MEGAHIT assembler.
  • IDBA-UD 8 assembles paired-end reads from single-cell or metagenomic sequencing technologies using the IDBA-UD assembler.
  • Compare Assembled Contig Distributions allows the user to view distributions of contig characteristics for different assembly runs.
  • MaxBin2 Contig Binning 9,10 uses nucleotide composition information, source strain abundance, and phylogentic marker genes to perform binning through an Expectation-Maximization algorithm.
  • Assess Genome Quality with CheckM 11 provides a set of tools for assessing the quality of genomes or metagenomes. It also generates estimates of genome completeness and contamination, and allows the user to filter the bins by these quality scores.
  • Extract Bins as Assemblies from BinnedContigs extracts bins from a BinnedContig dataset as Assembly objects.
  • Annotate Microbial Assembly with RASTtk 12 annotates a bacterial or archaeal assembly using the RAST (Rapid Annotations using Subsystems Technology) pipeline.
  • Build GenomeSet allows the user to group Genomes into a GenomeSet.
  • GTDB-Tk Classify 13 provides a taxonomic placement of the Genomes into the GTDB protein phylogenetic marker derived Species Tree.
  • Insert Set of Genomes into Species Tree 14 constructs a phylogenetic tree combining the GenomeSet provided by the user with a set of closely related genomes from the KBase list of species.
  • Annotate and Distill Assemblies with DRAM 15 identifies and summarizes functional markers in Genomes to assess pathway completeness. (Note: This app is in Beta, and therefore, the apps panel must be put into Beta rather than Released to add it to the narrative.)

Genome Extraction from Shotgun Metagenome Sequence Data Nav - 01

Top Read Hygiene Classify Taxonomy Assemble Compare Contigs Bin Contigs Optimize Bins Bin Quality Assessment Extract Individual Assemblies Annotate Genomes Classify Taxonomy Find Relatives Functional Profiling

1. Read Hygiene

We begin by importing the sets of paired-end reads in FASTQ format for metagenomes 37A and 37B. The import App creates a PairedEndLibrary object that we can then run through FastQC and Trimmomatic to determine and improve the quality of the reads, respectively.

Note: Running FastQC a second time after the reads had been run through Trimmomatic showed marked improvement in their quality. For this reason, the trimmed reads will be used for further analysis.
Import a FASTQ/SRA file into your Narrative as a Reads data object
This app completed without errors in 41m 45s.
Objects
Created Object Name Type Description
37A_6437.3.44325.CTTGTA.adnq.fastq.gz_reads PairedEndLibrary Imported Reads
Links
Import a FASTQ/SRA file into your Narrative as a Reads data object
This app completed without errors in 36m 44s.
Objects
Created Object Name Type Description
37B_6385.3.43508.GATCAG.adnq.fastq.gz_reads PairedEndLibrary Imported Reads
Links
A quality control application for high throughput sequence data.
This app completed without errors in 57m 2s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • 37A_6437.3.44325.CTTGTA.adnq.fastq.gz_reads_33233_8_1.fwd_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
  • 37A_6437.3.44325.CTTGTA.adnq.fastq.gz_reads_33233_8_1.rev_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
A quality control application for high throughput sequence data.
This app completed without errors in 41m 52s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • 37B_6385.3.43508.GATCAG.adnq.fastq.gz_reads_33233_4_1.fwd_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
  • 37B_6385.3.43508.GATCAG.adnq.fastq.gz_reads_33233_4_1.rev_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
Trim paired- or single-end Illumina reads with Trimmomatic.
This app completed without errors in 1h 32m 6s.
Objects
Created Object Name Type Description
37A_Trimm_headcrop5_crop140.PElib_paired PairedEndLibrary Trimmed Reads
37A_Trimm_headcrop5_crop140.PElib_unpaired_fwd SingleEndLibrary Trimmed Unpaired Forward Reads
37A_Trimm_headcrop5_crop140.PElib_unpaired_rev SingleEndLibrary Trimmed Unpaired Reverse Reads
Trim paired- or single-end Illumina reads with Trimmomatic.
This app completed without errors in 1h 2m 37s.
Objects
Created Object Name Type Description
37B_Trimm_headcrop5_crop140.PELib_paired PairedEndLibrary Trimmed Reads
37B_Trimm_headcrop5_crop140.PELib_unpaired_fwd SingleEndLibrary Trimmed Unpaired Forward Reads
37B_Trimm_headcrop5_crop140.PELib_unpaired_rev SingleEndLibrary Trimmed Unpaired Reverse Reads
A quality control application for high throughput sequence data.
This app completed without errors in 47m 17s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • 37A_Trimm_headcrop5_crop140.PElib_paired_33233_61_1.fwd_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
  • 37A_Trimm_headcrop5_crop140.PElib_paired_33233_61_1.rev_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
A quality control application for high throughput sequence data.
This app completed without errors in 38m 42s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • 37B_Trimm_headcrop5_crop140.PELib_paired_33233_57_1.fwd_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
  • 37B_Trimm_headcrop5_crop140.PELib_paired_33233_57_1.rev_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report

Genome Extraction from Shotgun Metagenome Sequence Data Nav - 02

2. Classify Taxonomy

Our downstream analysis will generate Species Trees of the organisms present in the compost samples. Classifying the taxonomy with Kaiju is important because it predicts the microbial composition based on protein similarities rather than genome assembly and annotation. This prediction can be used to compare the Species Trees against.

Taxonomic Classification of Shotgun Metagenomic Read data
This app completed without errors in 1h 44m 29s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • kaiju_classifications.zip
  • kaiju_summaries.zip
  • krona_data.zip
  • stacked_bar_abundance_plots_PNG+PDF.zip

Genome Extraction from Shotgun Metagenome Sequence Data Nav - 03

3. Assemble

Now that we have cleaned the reads, we can move on to the next step: assembling the reads into contiguous fragments (contigs) thus creating the scaffolding of the whole genome. KBase offers several metagenomic assemblers and a tool for comparing their output (similar to QUAST). We will run three assembly Apps below.

Note: metaSPAdes only accepts a single library as input, so the App Merge Multiple ReadsLib to One ReadsLib was used to combine the reads from 37A and 37B. Because the combined 37AB reads produced the largest contig and the most contigs over 100,000 bp, it will be used as input for the MEGAHIT and IDBA-UD assemblers.
Note: We will run the MEGAHIT assembler twice using different parameters but the same input data. The first run will use "meta-large" as its preset. This is a setting catered towards large and complex assemblies. The second run will use the "meta-sensitive" preset. This parameter generates a more sensitive assembly but runs slower.
Assemble metagenomic reads using the SPAdes assembler.
This app completed without errors in 15h 19m 47s.
Objects
Created Object Name Type Description
37A_metaSPAdes.contigs Assembly Assembled contigs
Summary
Assembly saved to: mm_clark:narrative_1528825054112/37A_metaSPAdes.contigs Assembled into 16927 contigs. Avg Length: 8328.566550481479 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 16752 -- 2000.0 to 97323.0 bp 121 -- 97323.0 to 192646.0 bp 28 -- 192646.0 to 287969.0 bp 13 -- 287969.0 to 383292.0 bp 3 -- 383292.0 to 478615.0 bp 1 -- 478615.0 to 573938.0 bp 4 -- 573938.0 to 669261.0 bp 0 -- 669261.0 to 764584.0 bp 1 -- 764584.0 to 859907.0 bp 4 -- 859907.0 to 955230.0 bp
Links
Assemble metagenomic reads using the SPAdes assembler.
This app completed without errors in 12h 13m 16s.
Objects
Created Object Name Type Description
37B_metaSPAdes.contigs Assembly Assembled contigs
Summary
Assembly saved to: mm_clark:narrative_1528825054112/37B_metaSPAdes.contigs Assembled into 15971 contigs. Avg Length: 9701.227537411558 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 15831 -- 2000.0 to 125927.3 bp 96 -- 125927.3 to 249854.6 bp 30 -- 249854.6 to 373781.9 bp 6 -- 373781.9 to 497709.2 bp 2 -- 497709.2 to 621636.5 bp 3 -- 621636.5 to 745563.8 bp 0 -- 745563.8 to 869491.1 bp 1 -- 869491.1 to 993418.4 bp 0 -- 993418.4 to 1117345.7 bp 2 -- 1117345.7 to 1241273.0 bp
Links
Merge Multiple Reads Libraries into One Reads Library
This app completed without errors in 1h 12m 19s.
Objects
Created Object Name Type Description
37AB_trimm_headcrop5_crop140.PELib_paired PairedEndLibrary 37A and 37B trimmed and merged
Summary
NUM READS LIBRARIES COMBINED INTO ONE READS LIBRARY: 2
Output from Merge Multiple ReadsLibs to One ReadsLib - v1.0.1
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/33233
Assemble metagenomic reads using the SPAdes assembler.
This app completed without errors in 1d 13h 57m 20s.
Objects
Created Object Name Type Description
37AB_metaSPAdes.contigs Assembly Assembled contigs
Summary
Assembly saved to: mm_clark:narrative_1528825054112/37AB_metaSPAdes.contigs Assembled into 27845 contigs. Avg Length: 9230.545519841982 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 27688 -- 2000.0 to 141523.7 bp 116 -- 141523.7 to 281047.4 bp 29 -- 281047.4 to 420571.10000000003 bp 4 -- 420571.10000000003 to 560094.8 bp 3 -- 560094.8 to 699618.5 bp 1 -- 699618.5 to 839142.2000000001 bp 3 -- 839142.2000000001 to 978665.9000000001 bp 0 -- 978665.9000000001 to 1118189.6 bp 0 -- 1118189.6 to 1257713.3 bp 1 -- 1257713.3 to 1397237.0 bp
Links
Assemble metagenomic reads using the MEGAHIT assembler.
This app completed without errors in 2h 42m 54s.
Objects
Created Object Name Type Description
37AB_MEGAHIT_metalarge.contigs Assembly Assembled contigs
Summary
ContigSet saved to: mm_clark:narrative_1528825054112/37AB_MEGAHIT_metalarge.contigs Assembled into 29290 contigs. Avg Length: 8958.983202458177 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 29041 -- 2000.0 to 107079.0 bp 186 -- 107079.0 to 212158.0 bp 36 -- 212158.0 to 317237.0 bp 15 -- 317237.0 to 422316.0 bp 5 -- 422316.0 to 527395.0 bp 1 -- 527395.0 to 632474.0 bp 2 -- 632474.0 to 737553.0 bp 0 -- 737553.0 to 842632.0 bp 2 -- 842632.0 to 947711.0 bp 2 -- 947711.0 to 1052790.0 bp
Links
Assemble metagenomic reads using the MEGAHIT assembler.
This app completed without errors in 2h 54m 48s.
Objects
Created Object Name Type Description
37AB_MEGAHIT_metasensitive.contigs Assembly Assembled contigs
Summary
ContigSet saved to: mm_clark:narrative_1528825054112/37AB_MEGAHIT_metasensitive.contigs Assembled into 29070 contigs. Avg Length: 9066.479532163743 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 28774 -- 2000.0 to 95869.9 bp 209 -- 95869.9 to 189739.8 bp 45 -- 189739.8 to 283609.69999999995 bp 22 -- 283609.69999999995 to 377479.6 bp 11 -- 377479.6 to 471349.5 bp 4 -- 471349.5 to 565219.3999999999 bp 1 -- 565219.3999999999 to 659089.2999999999 bp 2 -- 659089.2999999999 to 752959.2 bp 0 -- 752959.2 to 846829.1 bp 2 -- 846829.1 to 940699.0 bp
Links
Assemble paired-end reads from single-cell or metagenomic sequencing technologies using the IDBA-UD assembler.
This app completed without errors in 7h 41m 36s.
Objects
Created Object Name Type Description
37AB_IDBA-UD.contigs Assembly Assembled contigs
Summary
Assembly saved to: mm_clark:narrative_1528825054112/37AB_IDBA-UD.contigs Assembled into 23276 contigs. Avg Length: 8888.39615913 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 22716 -- 2000.0 to 43954.7 bp 371 -- 43954.7 to 85909.4 bp 99 -- 85909.4 to 127864.1 bp 38 -- 127864.1 to 169818.8 bp 26 -- 169818.8 to 211773.5 bp 11 -- 211773.5 to 253728.2 bp 4 -- 253728.2 to 295682.9 bp 4 -- 295682.9 to 337637.6 bp 3 -- 337637.6 to 379592.3 bp 4 -- 379592.3 to 421547.0 bp
Links

Genome Extraction from Shotgun Metagenome Sequence Data Nav - 04

4. Compare Contigs

Now that we have six sets of contigs generated from our reads using various assemblers, we can examine them side-by-side to determine the one of highest quality. Using the best assembly will produce more accurate results in downstream analyses.

The table below summarizes the most important values (high N50, low L50, fewest contigs) generated by running Compare Assembled Contig Distribution. The best values in each category have been underlined and italicized.

N50: the shortest sequence length containing 50% of the entire assembly.

L50: the least number of contigs whose sum lenth is equal to N50.

Assembly Number of Contigs Longest Contig (bp) N50 L50 Contigs > 106 Sum Length (bp) Contigs
37AB_metaSPAdes 27508 1395711 *23793* *1871* *2* *2791420*
37AB_MEGAHIT_metalarge 32090 1501073 14663 3350 1 1501073
37AB_MEGAHIT_metasensitive 31055 *1501101* 16426 3008 1 1501101
37AB_IDBA-UD *23276* 421547 15901 2717 0 0
Note: The Assembly from running metaSPAdes on the combined library generated the highest quality contigs (37AB_metaSPAdes.contigs). Therefore, it will be used for further analysis.
View distributions of contig characteristics for different assemblies
This app completed without errors in 8m 38s.
Summary
ASSEMBLY STATS for 37A_metaSPAdes.contigs Len longest contig: 940029 bp N50 (L50): 23179 (967) N75 (L75): 4913 (4776) N90 (L90): 2758 (10599) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 162 Num contigs >= 10000 bp: 2129 Num contigs >= 1000 bp: 16576 Num contigs >= 500 bp: 16576 Num contigs >= 1 bp: 16576 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 34022723 bp Len contigs >= 10000 bp: 86871366 bp Len contigs >= 1000 bp: 139444103 bp Len contigs >= 500 bp: 139444103 bp Len contigs >= 1 bp: 139444103 bp ASSEMBLY STATS for 37B_metaSPAdes.contigs Len longest contig: 1241273 bp N50 (L50): 25790 (990) N75 (L75): 6686 (4063) N90 (L90): 3042 (9403) Num contigs >= 1000000 bp: 2 Num contigs >= 100000 bp: 188 Num contigs >= 10000 bp: 2733 Num contigs >= 1000 bp: 15751 Num contigs >= 500 bp: 15751 Num contigs >= 1 bp: 15751 Len contigs >= 1000000 bp: 2384295 bp Len contigs >= 100000 bp: 40194065 bp Len contigs >= 10000 bp: 104646915 bp Len contigs >= 1000 bp: 153948991 bp Len contigs >= 500 bp: 153948991 bp Len contigs >= 1 bp: 153948991 bp ASSEMBLY STATS for 37AB_metaSPAdes.contigs Len longest contig: 1395711 bp N50 (L50): 23793 (1871) N75 (L75): 6077 (7642) N90 (L90): 2998 (16930) Num contigs >= 1000000 bp: 2 Num contigs >= 100000 bp: 294 Num contigs >= 10000 bp: 4567 Num contigs >= 1000 bp: 27508 Num contigs >= 500 bp: 27508 Num contigs >= 1 bp: 27508 Len contigs >= 1000000 bp: 2791420 bp Len contigs >= 100000 bp: 57207896 bp Len contigs >= 10000 bp: 168877596 bp Len contigs >= 1000 bp: 256629856 bp Len contigs >= 500 bp: 256629856 bp Len contigs >= 1 bp: 256629856 bp ASSEMBLY STATS for 37AB_MEGAHIT_metalarge.contigs Len longest contig: 1501073 bp N50 (L50): 14663 (3350) N75 (L75): 5247 (11039) N90 (L90): 2868 (21229) Num contigs >= 1000000 bp: 1 Num contigs >= 100000 bp: 194 Num contigs >= 10000 bp: 5366 Num contigs >= 1000 bp: 32090 Num contigs >= 500 bp: 32090 Num contigs >= 1 bp: 32090 Len contigs >= 1000000 bp: 1501073 bp Len contigs >= 100000 bp: 35667574 bp Len contigs >= 10000 bp: 153312343 bp Len contigs >= 1000 bp: 258307395 bp Len contigs >= 500 bp: 258307395 bp Len contigs >= 1 bp: 258307395 bp ASSEMBLY STATS for 37AB_MEGAHIT_metasensitive.contigs Len longest contig: 1501101 bp N50 (L50): 16426 (3008) N75 (L75): 5431 (10198) N90 (L90): 2893 (20226) Num contigs >= 1000000 bp: 1 Num contigs >= 100000 bp: 183 Num contigs >= 10000 bp: 5301 Num contigs >= 1000 bp: 31055 Num contigs >= 500 bp: 31055 Num contigs >= 1 bp: 31055 Len contigs >= 1000000 bp: 1501101 bp Len contigs >= 100000 bp: 36201788 bp Len contigs >= 10000 bp: 158541746 bp Len contigs >= 1000 bp: 258865169 bp Len contigs >= 500 bp: 258865169 bp Len contigs >= 1 bp: 258865169 bp ASSEMBLY STATS for 37AB_IDBA-UD.contigs Len longest contig: 421547 bp N50 (L50): 15910 (2717) N75 (L75): 6126 (8080) N90 (L90): 3191 (15203) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 150 Num contigs >= 10000 bp: 4835 Num contigs >= 1000 bp: 23276 Num contigs >= 500 bp: 23276 Num contigs >= 1 bp: 23276 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 25040764 bp Len contigs >= 10000 bp: 130110156 bp Len contigs >= 1000 bp: 206886309 bp Len contigs >= 500 bp: 206886309 bp Len contigs >= 1 bp: 206886309 bp
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • key_plot.png
  • key_plot.pdf
  • cumulative_len_plot.png
  • cumulative_len_plot.pdf
  • sorted_contig_lengths.png
  • sorted_contig_lengths.pdf
  • histogram_figures.zip

Genome Extraction from Shotgun Metagenome Sequence Data Nav - 05

5. Bin Contigs

Having assembled the contigs, the next step is to cluster them into bins, each of which corresponds to a putative population genome. To accomplish this, we will use MaxBin2 Contig Binning.

Group assembled metagenomic contigs into lineages (Bins) using depth-of-coverage, nucleotide composition, and marker genes.
This app completed without errors in 4h 22m 55s.
Objects
Created Object Name Type Description
37AB_metaSPAdes_MaxBin2-0.8prob-107markers.BinnedContigs BinnedContigs BinnedContigs from MaxBin2
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • maxbin_result.zip - File(s) generated by MaxBin2 App
Output from Bin Contigs using MaxBin2 - v2.2.4
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/33233
Bin metagenomic contigs
This app completed without errors in 4h 52m 54s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • metabat_result.zip - Files generated by MetaBAT2 App
Output from MetaBAT2 Contig Binning - v1.7
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/33233
Group assembled metagenomic contigs into lineages (Bins) using depth-of-coverage and nucleotide composition
This app completed without errors in 6h 9m 7s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • concoct_result.zip - Files generated by CONCOCT App

Genome Extraction from Shotgun Metagenome Sequence Data Nav - 06

6. Optimize Binned Contigs by Consensus

Having binned the contigs with three methods, the next step is improve the quality of the binning using consensus assignments. To accomplish this, we will use DAS-Tool for developing consensus Binned Contigs from the BinnedContigs output by MaxBin2, MetaBAT2, and CONCOCT. This yields 40 bins with at least 50% completeness (using the DAS-Tool set of Singcle Copy Genes).

Optimize bacterial or archaeal genome bins using a dereplication, aggregation and scoring strategy
This app completed without errors in 28m 29s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • das_tool_result.zip - Files generated by kb_das_tool App
The viewer for the data in this Cell is available at the original Narrative here: https://narrative.kbase.us/narrative/33233

Genome Extraction from Shotgun Metagenome Sequence Data Nav - 07

7. Bin Quality Assessment

Quality control is a necessary step at every level of analysis to ensure the highest quality outcome and to avoid error propagation. We first examine the quality of the Bins with Assess Genome Quality with CheckM to determine which thresholds to use to capture the High Quality (HQ) bins in the next step Filter Bins by Quality with CheckM.

Note: From examination of the CheckM Table, we found that 23 of the 40 bins meet the very strict thresholds of >= 95% complete and <= 2% contamination, and 36 of the 40 meet still quite high thresholds of 90% complete and 5% contamination. For the purposes of this study, we chose to continue analysis on the latter set of 36 bins at 90%/5%. The researcher should decide which thresholds are appropriate to their study.
Runs the CheckM lineage workflow to assess the genome quality of isolates, single cells, or genome bins from metagenome assemblies through comparison to an existing database of genomes.
This app completed without errors in 1h 7m 13s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • CheckM_summary_table.tsv.zip - TSV Summary Table from CheckM
  • full_output.zip - Full output of CheckM
  • plots.zip - Output plots from CheckM
Runs the CheckM lineage workflow to assess the genome quality of isolates, single cells, or genome bins from metagenome assemblies through comparison to an existing database of genomes. Creates a new BinnedContigs object with High Quality bins that pass user-defined thresholds for Completeness and Contamination.
This app completed without errors in 1h 1m 35s.
Objects
Created Object Name Type Description
37AB-metaSPAdes-MaxBin2_MetaBAT2_CONCOCT_DasTool-HQ_95-2.BinnedContigs BinnedContigs HQ BinnedContigs 37AB-metaSPAdes-MaxBin2_MetaBAT2_CONCOCT_DasTool-HQ_95-2.BinnedContigs
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • CheckM_summary_table.tsv.zip - TSV Summary Table from CheckM
  • full_output.zip - Full output of CheckM
  • plots.zip - Output plots from CheckM
The viewer for the data in this Cell is available at the original Narrative here: https://narrative.kbase.us/narrative/33233
Runs the CheckM lineage workflow to assess the genome quality of isolates, single cells, or genome bins from metagenome assemblies through comparison to an existing database of genomes. Creates a new BinnedContigs object with High Quality bins that pass user-defined thresholds for Completeness and Contamination.
This app completed without errors in 1h 17m 25s.
Objects
Created Object Name Type Description
37AB-metaSPAdes-MaxBin2_MetaBAT2_CONCOCT_DasTool-HQ_90-5.BinnedContigs BinnedContigs HQ BinnedContigs 37AB-metaSPAdes-MaxBin2_MetaBAT2_CONCOCT_DasTool-HQ_90-5.BinnedContigs
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • CheckM_summary_table.tsv.zip - TSV Summary Table from CheckM
  • full_output.zip - Full output of CheckM
  • plots.zip - Output plots from CheckM
The viewer for the data in this Cell is available at the original Narrative here: https://narrative.kbase.us/narrative/33233

Genome Extraction from Shotgun Metagenome Sequence Data Nav - 08

8. Extract Individual Assemblies

To use the desired 36 high quality bins in downstream Apps, we need them to be in the form of Assembly objects. This is achieved by running Extract Bins as Assemblies from BinnedContigs.

Note: The default is to extract ALL bins. Since we have already filtered the BinnedContig object to just the High Quality set, we can use the default. If we wished to just extract some of them, we would have had to specify the ones that we want.
Extract a bin as an Assembly from a BinnedContig dataset
This app completed without errors in 1h 26m 23s.
Objects
Created Object Name Type Description
37AB-metaSPAdes-MaxBin2_MetaBAT2_CONCOCT_BowTie2_DASTool-HQ_90-5.AssemblySet AssemblySet Assembly set of extracted assemblies
Bin.052.fasta_assembly Assembly Assembly object of extracted contigs
Bin.063.fasta_assembly Assembly Assembly object of extracted contigs
Bin.013.fasta_assembly Assembly Assembly object of extracted contigs
Bin.047.fasta_assembly Assembly Assembly object of extracted contigs
Bin.025.fasta_assembly Assembly Assembly object of extracted contigs
Bin.062.fasta_assembly Assembly Assembly object of extracted contigs
Bin.037.fasta_assembly Assembly Assembly object of extracted contigs
Bin.067.fasta_assembly Assembly Assembly object of extracted contigs
Bin.058.fasta_assembly Assembly Assembly object of extracted contigs
Bin.039.fasta_assembly Assembly Assembly object of extracted contigs
Bin.028.fasta_assembly Assembly Assembly object of extracted contigs
Bin.044.fasta_assembly Assembly Assembly object of extracted contigs
Bin.005.fasta_assembly Assembly Assembly object of extracted contigs
Bin.078.fasta_assembly Assembly Assembly object of extracted contigs
Bin.020.fasta_assembly Assembly Assembly object of extracted contigs
Bin.053.fasta_assembly Assembly Assembly object of extracted contigs
Bin.043.fasta_assembly Assembly Assembly object of extracted contigs
Bin.029.fasta_assembly Assembly Assembly object of extracted contigs
Bin.061.fasta_assembly Assembly Assembly object of extracted contigs
Bin.098.fasta_assembly Assembly Assembly object of extracted contigs
Bin.003.fasta_assembly Assembly Assembly object of extracted contigs
Bin.018.fasta_assembly Assembly Assembly object of extracted contigs
Bin.004.fasta_assembly Assembly Assembly object of extracted contigs
Bin.042.fasta_assembly Assembly Assembly object of extracted contigs
Bin.076.fasta_assembly Assembly Assembly object of extracted contigs
Bin.011.fasta_assembly Assembly Assembly object of extracted contigs
Bin.041.fasta_assembly Assembly Assembly object of extracted contigs
Bin.059.fasta_assembly Assembly Assembly object of extracted contigs
Bin.056.fasta_assembly Assembly Assembly object of extracted contigs
Bin.057.fasta_assembly Assembly Assembly object of extracted contigs
Bin.014.fasta_assembly Assembly Assembly object of extracted contigs
Bin.021.fasta_assembly Assembly Assembly object of extracted contigs
Bin.080.fasta_assembly Assembly Assembly object of extracted contigs
Bin.077.fasta_assembly Assembly Assembly object of extracted contigs
Bin.033.fasta_assembly Assembly Assembly object of extracted contigs
Bin.065.fasta_assembly Assembly Assembly object of extracted contigs
Summary
Job Finished Generated Assembly Reference: 33233/625/1, 33233/626/1, 33233/416/2, 33233/627/1, 33233/628/1, 33233/431/2, 33233/629/1, 33233/630/1, 33233/631/1, 33233/632/1, 33233/633/1, 33233/634/1, 33233/409/2, 33233/635/1, 33233/636/1, 33233/637/1, 33233/638/1, 33233/639/1, 33233/640/1, 33233/641/1, 33233/407/2, 33233/642/1, 33233/408/2, 33233/643/1, 33233/644/1, 33233/414/2, 33233/645/1, 33233/430/2, 33233/429/2, 33233/646/1, 33233/417/2, 33233/647/1, 33233/648/1, 33233/649/1, 33233/427/2, 33233/650/1 Generated Assembly Set: 33233/651/1

Genome Extraction from Shotgun Metagenome Sequence Data Nav - 09

9. Annotate Genomes

Since we now have the high quality bins in Assembly object form (and collected into an Assembly Set object), we will use Annotate Multiple Microbial Assemblies with RASTtk to turn them into annotated Genomes using the Rapid Annotation Subsystem Technology (RAST) pipeline.

Note: If you wish to just do a limited number of annotations, you can run them separately with the Annotate Microbial Assembly App.

Once the high qualitiy bins have all been annotated, the annotation App creates a GenomeSet which will be used as input for the downstream analyses.

Annotate bacterial or archaeal assemblies and/or assembly sets using RASTtk.
This app completed without errors in 2h 19m 19s.
Objects
Created Object Name Type Description
Bin.047.fasta_assembly.RAST Genome Annotated genome
Bin.025.fasta_assembly.RAST Genome Annotated genome
Bin.077.fasta_assembly.RAST Genome Annotated genome
Bin.037.fasta_assembly.RAST Genome Annotated genome
Bin.033.fasta_assembly.RAST Genome Annotated genome
Bin.078.fasta_assembly.RAST Genome Annotated genome
Bin.014.fasta_assembly.RAST Genome Annotated genome
Bin.044.fasta_assembly.RAST Genome Annotated genome
Bin.053.fasta_assembly.RAST Genome Annotated genome
Bin.020.fasta_assembly.RAST Genome Annotated genome
Bin.058.fasta_assembly.RAST Genome Annotated genome
Bin.067.fasta_assembly.RAST Genome Annotated genome
Bin.028.fasta_assembly.RAST Genome Annotated genome
Bin.039.fasta_assembly.RAST Genome Annotated genome
Bin.065.fasta_assembly.RAST Genome Annotated genome
Bin.059.fasta_assembly.RAST Genome Annotated genome
Bin.062.fasta_assembly.RAST Genome Annotated genome
Bin.005.fasta_assembly.RAST Genome Annotated genome
Bin.029.fasta_assembly.RAST Genome Annotated genome
Bin.004.fasta_assembly.RAST Genome Annotated genome
Bin.043.fasta_assembly.RAST Genome Annotated genome
Bin.011.fasta_assembly.RAST Genome Annotated genome
Bin.003.fasta_assembly.RAST Genome Annotated genome
Bin.013.fasta_assembly.RAST Genome Annotated genome
Bin.057.fasta_assembly.RAST Genome Annotated genome
Bin.056.fasta_assembly.RAST Genome Annotated genome
Bin.041.fasta_assembly.RAST Genome Annotated genome
Bin.052.fasta_assembly.RAST Genome Annotated genome
Bin.080.fasta_assembly.RAST Genome Annotated genome
Bin.063.fasta_assembly.RAST Genome Annotated genome
Bin.021.fasta_assembly.RAST Genome Annotated genome
Bin.018.fasta_assembly.RAST Genome Annotated genome
Bin.098.fasta_assembly.RAST Genome Annotated genome
Bin.076.fasta_assembly.RAST Genome Annotated genome
Bin.042.fasta_assembly.RAST Genome Annotated genome
Bin.061.fasta_assembly.RAST Genome Annotated genome
37AB-metaSPAdes-DASTool-HQ_90-5.GenomeSet GenomeSet Genome Set
Summary
The RAST algorithm was applied to annotating a genome sequence comprised of 72 contigs containing 3555162 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3672 new features were called, of which 357 are non-coding.
Output genome has the following feature types:
	Coding gene                     3315 
	Non-coding crispr_array            1 
	Non-coding crispr_repeat         157 
	Non-coding crispr_spacer         156 
	Non-coding rna                    43 
Overall, the genes have 1814 distinct functions. 
The genes include 1750 genes with a SEED annotation ontology across 916 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.047.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 27 contigs containing 2818397 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 2628 new features were called, of which 32 are non-coding.
Output genome has the following feature types:
	Coding gene                     2596 
	Non-coding rna                    32 
Overall, the genes have 1354 distinct functions. 
The genes include 1273 genes with a SEED annotation ontology across 781 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.025.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 128 contigs containing 4423359 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3762 new features were called, of which 45 are non-coding.
Output genome has the following feature types:
	Coding gene                     3717 
	Non-coding crispr_array            1 
	Non-coding crispr_repeat           7 
	Non-coding crispr_spacer           6 
	Non-coding repeat                  2 
	Non-coding rna                    29 
Overall, the genes have 1716 distinct functions. 
The genes include 1797 genes with a SEED annotation ontology across 859 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.077.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 8 contigs containing 3009503 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3025 new features were called, of which 49 are non-coding.
Output genome has the following feature types:
	Coding gene                     2976 
	Non-coding repeat                  2 
	Non-coding rna                    47 
Overall, the genes have 1729 distinct functions. 
The genes include 1580 genes with a SEED annotation ontology across 979 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.037.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 664 contigs containing 4711183 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 4830 new features were called, of which 68 are non-coding.
Output genome has the following feature types:
	Coding gene                     4762 
	Non-coding repeat                 44 
	Non-coding rna                    24 
Overall, the genes have 1699 distinct functions. 
The genes include 2025 genes with a SEED annotation ontology across 881 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.033.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 57 contigs containing 3885231 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 4110 new features were called, of which 527 are non-coding.
Output genome has the following feature types:
	Coding gene                     3583 
	Non-coding crispr_array            3 
	Non-coding crispr_repeat         233 
	Non-coding crispr_spacer         230 
	Non-coding repeat                 12 
	Non-coding rna                    49 
Overall, the genes have 1323 distinct functions. 
The genes include 2209 genes with a SEED annotation ontology across 815 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.078.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 69 contigs containing 2896910 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 2798 new features were called, of which 34 are non-coding.
Output genome has the following feature types:
	Coding gene                     2764 
	Non-coding repeat                  2 
	Non-coding rna                    32 
Overall, the genes have 1272 distinct functions. 
The genes include 1291 genes with a SEED annotation ontology across 715 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.014.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 62 contigs containing 4602386 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 4072 new features were called, of which 36 are non-coding.
Output genome has the following feature types:
	Coding gene                     4036 
	Non-coding rna                    36 
Overall, the genes have 1585 distinct functions. 
The genes include 1883 genes with a SEED annotation ontology across 885 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.044.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 65 contigs containing 2921546 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 2983 new features were called, of which 93 are non-coding.
Output genome has the following feature types:
	Coding gene                     2890 
	Non-coding crispr_array            1 
	Non-coding crispr_repeat          25 
	Non-coding crispr_spacer          24 
	Non-coding repeat                  2 
	Non-coding rna                    41 
Overall, the genes have 1916 distinct functions. 
The genes include 1510 genes with a SEED annotation ontology across 1033 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.053.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 225 contigs containing 4363961 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 4884 new features were called, of which 346 are non-coding.
Output genome has the following feature types:
	Coding gene                     4538 
	Non-coding crispr_array            3 
	Non-coding crispr_repeat         144 
	Non-coding crispr_spacer         141 
	Non-coding repeat                 16 
	Non-coding rna                    42 
Overall, the genes have 2572 distinct functions. 
The genes include 1868 genes with a SEED annotation ontology across 1157 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.020.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 498 contigs containing 4388328 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 4214 new features were called, of which 99 are non-coding.
Output genome has the following feature types:
	Coding gene                     4115 
	Non-coding repeat                 71 
	Non-coding rna                    28 
Overall, the genes have 1780 distinct functions. 
The genes include 1836 genes with a SEED annotation ontology across 873 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.058.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 37 contigs containing 4214068 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 4295 new features were called, of which 352 are non-coding.
Output genome has the following feature types:
	Coding gene                     3943 
	Non-coding crispr_array            4 
	Non-coding crispr_repeat         151 
	Non-coding crispr_spacer         147 
	Non-coding repeat                 11 
	Non-coding rna                    39 
Overall, the genes have 2037 distinct functions. 
The genes include 1856 genes with a SEED annotation ontology across 1124 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.067.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 326 contigs containing 3069604 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3116 new features were called, of which 33 are non-coding.
Output genome has the following feature types:
	Coding gene                     3083 
	Non-coding repeat                  2 
	Non-coding rna                    31 
Overall, the genes have 1877 distinct functions. 
The genes include 1566 genes with a SEED annotation ontology across 1036 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.028.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 107 contigs containing 3853049 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3907 new features were called, of which 60 are non-coding.
Output genome has the following feature types:
	Coding gene                     3847 
	Non-coding repeat                 21 
	Non-coding rna                    39 
Overall, the genes have 1940 distinct functions. 
The genes include 1876 genes with a SEED annotation ontology across 1082 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.039.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 138 contigs containing 4543190 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3793 new features were called, of which 66 are non-coding.
Output genome has the following feature types:
	Coding gene                     3727 
	Non-coding crispr_array            1 
	Non-coding crispr_repeat           3 
	Non-coding crispr_spacer           2 
	Non-coding repeat                 28 
	Non-coding rna                    32 
Overall, the genes have 1713 distinct functions. 
The genes include 1833 genes with a SEED annotation ontology across 848 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.065.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 164 contigs containing 3174826 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3035 new features were called, of which 46 are non-coding.
Output genome has the following feature types:
	Coding gene                     2989 
	Non-coding repeat                  9 
	Non-coding rna                    37 
Overall, the genes have 1442 distinct functions. 
The genes include 1580 genes with a SEED annotation ontology across 825 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.059.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 45 contigs containing 2474027 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 2241 new features were called, of which 37 are non-coding.
Output genome has the following feature types:
	Coding gene                     2204 
	Non-coding rna                    37 
Overall, the genes have 1131 distinct functions. 
The genes include 1099 genes with a SEED annotation ontology across 671 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.062.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 107 contigs containing 3093537 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 2941 new features were called, of which 46 are non-coding.
Output genome has the following feature types:
	Coding gene                     2895 
	Non-coding repeat                  2 
	Non-coding rna                    44 
Overall, the genes have 1669 distinct functions. 
The genes include 1565 genes with a SEED annotation ontology across 992 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.005.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 61 contigs containing 2796035 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 2710 new features were called, of which 47 are non-coding.
Output genome has the following feature types:
	Coding gene                     2663 
	Non-coding repeat                  8 
	Non-coding rna                    39 
Overall, the genes have 1888 distinct functions. 
The genes include 1385 genes with a SEED annotation ontology across 1104 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.029.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 84 contigs containing 1834279 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 1841 new features were called, of which 43 are non-coding.
Output genome has the following feature types:
	Coding gene                     1798 
	Non-coding repeat                  8 
	Non-coding rna                    35 
Overall, the genes have 1070 distinct functions. 
The genes include 1022 genes with a SEED annotation ontology across 674 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.004.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 402 contigs containing 4749785 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 5070 new features were called, of which 53 are non-coding.
Output genome has the following feature types:
	Coding gene                     5017 
	Non-coding repeat                 12 
	Non-coding rna                    41 
Overall, the genes have 2756 distinct functions. 
The genes include 1976 genes with a SEED annotation ontology across 1103 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.043.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 74 contigs containing 2343119 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 2372 new features were called, of which 45 are non-coding.
Output genome has the following feature types:
	Coding gene                     2327 
	Non-coding repeat                  6 
	Non-coding rna                    39 
Overall, the genes have 1708 distinct functions. 
The genes include 1314 genes with a SEED annotation ontology across 1029 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.011.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 221 contigs containing 3717720 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3698 new features were called, of which 172 are non-coding.
Output genome has the following feature types:
	Coding gene                     3526 
	Non-coding crispr_array            2 
	Non-coding crispr_repeat          66 
	Non-coding crispr_spacer          64 
	Non-coding repeat                  6 
	Non-coding rna                    34 
Overall, the genes have 1662 distinct functions. 
The genes include 1779 genes with a SEED annotation ontology across 868 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.003.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 48 contigs containing 3936667 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3528 new features were called, of which 63 are non-coding.
Output genome has the following feature types:
	Coding gene                     3465 
	Non-coding repeat                 26 
	Non-coding rna                    37 
Overall, the genes have 1607 distinct functions. 
The genes include 1617 genes with a SEED annotation ontology across 877 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.013.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 60 contigs containing 3633522 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3398 new features were called, of which 213 are non-coding.
Output genome has the following feature types:
	Coding gene                     3185 
	Non-coding crispr_array            2 
	Non-coding crispr_repeat          84 
	Non-coding crispr_spacer          82 
	Non-coding repeat                  8 
	Non-coding rna                    37 
Overall, the genes have 1560 distinct functions. 
The genes include 1670 genes with a SEED annotation ontology across 889 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.057.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 27 contigs containing 4247528 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3765 new features were called, of which 40 are non-coding.
Output genome has the following feature types:
	Coding gene                     3725 
	Non-coding rna                    40 
Overall, the genes have 1689 distinct functions. 
The genes include 1765 genes with a SEED annotation ontology across 898 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.056.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 87 contigs containing 1907535 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 1999 new features were called, of which 102 are non-coding.
Output genome has the following feature types:
	Coding gene                     1897 
	Non-coding crispr_array            1 
	Non-coding crispr_repeat          38 
	Non-coding crispr_spacer          37 
	Non-coding rna                    26 
Overall, the genes have 1161 distinct functions. 
The genes include 1125 genes with a SEED annotation ontology across 773 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.041.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 75 contigs containing 3259393 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 2923 new features were called, of which 40 are non-coding.
Output genome has the following feature types:
	Coding gene                     2883 
	Non-coding rna                    40 
Overall, the genes have 1216 distinct functions. 
The genes include 1538 genes with a SEED annotation ontology across 721 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.052.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 34 contigs containing 2315542 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 2279 new features were called, of which 44 are non-coding.
Output genome has the following feature types:
	Coding gene                     2235 
	Non-coding repeat                  6 
	Non-coding rna                    38 
Overall, the genes have 1213 distinct functions. 
The genes include 1126 genes with a SEED annotation ontology across 705 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.080.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 377 contigs containing 2383018 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 2615 new features were called, of which 143 are non-coding.
Output genome has the following feature types:
	Coding gene                     2472 
	Non-coding crispr_array            2 
	Non-coding crispr_repeat          30 
	Non-coding crispr_spacer          28 
	Non-coding repeat                 50 
	Non-coding rna                    33 
Overall, the genes have 1051 distinct functions. 
The genes include 1410 genes with a SEED annotation ontology across 700 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.063.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 29 contigs containing 5288769 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 5355 new features were called, of which 285 are non-coding.
Output genome has the following feature types:
	Coding gene                     5070 
	Non-coding crispr_array            4 
	Non-coding crispr_repeat         110 
	Non-coding crispr_spacer         106 
	Non-coding repeat                 18 
	Non-coding rna                    47 
Overall, the genes have 2255 distinct functions. 
The genes include 2512 genes with a SEED annotation ontology across 1167 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.021.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 47 contigs containing 2340301 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 2450 new features were called, of which 104 are non-coding.
Output genome has the following feature types:
	Coding gene                     2346 
	Non-coding crispr_array            1 
	Non-coding crispr_repeat          33 
	Non-coding crispr_spacer          32 
	Non-coding rna                    38 
Overall, the genes have 1128 distinct functions. 
The genes include 1234 genes with a SEED annotation ontology across 795 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.018.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 41 contigs containing 3018906 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3161 new features were called, of which 489 are non-coding.
Output genome has the following feature types:
	Coding gene                     2672 
	Non-coding crispr_array            4 
	Non-coding crispr_repeat         222 
	Non-coding crispr_spacer         218 
	Non-coding repeat                  2 
	Non-coding rna                    43 
Overall, the genes have 1690 distinct functions. 
The genes include 1487 genes with a SEED annotation ontology across 1012 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.098.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 108 contigs containing 1972189 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 2038 new features were called, of which 35 are non-coding.
Output genome has the following feature types:
	Coding gene                     2003 
	Non-coding rna                    35 
Overall, the genes have 1158 distinct functions. 
The genes include 1134 genes with a SEED annotation ontology across 751 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.076.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 98 contigs containing 4844810 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 4244 new features were called, of which 45 are non-coding.
Output genome has the following feature types:
	Coding gene                     4199 
	Non-coding repeat                  2 
	Non-coding rna                    43 
Overall, the genes have 1826 distinct functions. 
The genes include 2102 genes with a SEED annotation ontology across 922 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.042.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 31 contigs containing 3906929 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3707 new features were called, of which 203 are non-coding.
Output genome has the following feature types:
	Coding gene                     3504 
	Non-coding crispr_array            1 
	Non-coding crispr_repeat          83 
	Non-coding crispr_spacer          82 
	Non-coding repeat                  4 
	Non-coding rna                    33 
Overall, the genes have 1465 distinct functions. 
The genes include 1693 genes with a SEED annotation ontology across 852 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.061.fasta_assembly succeeded!

Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • annotation_report.37AB-metaSPAdes-DASTool-HQ_90-5.GenomeSet - Microbial Annotation Report
v1 - KBaseSearch.GenomeSet-2.1
The viewer for the data in this Cell is available at the original Narrative here: https://narrative.kbase.us/narrative/33233

Genome Extraction from Shotgun Metagenome Sequence Data Nav - 10a

10a. Taxonomic Classification of MAGs

Our new GenomeSet can be used as input for GTDB-Tk Classify, which will give us a phylogenetically-based taxonomic classification of the MAGs. This approach uses single-copy phylogenetic markers to place each Genome into the GTDB species tree.

Obtain objective taxonomic assignments for bacterial and archaeal genomes based on the Genome Taxonomy Database (GTDB) ver R06-RS202
This app completed without errors in 1h 15m 36s.
Objects
Created Object Name Type Description
Bin.033.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.062.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.059.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.011.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.005.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.004.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.013.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.003.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.014.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.056.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.047.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.025.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.077.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.037.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.078.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.044.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.053.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.020.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.058.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.067.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.028.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.039.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.065.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.029.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.043.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.057.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.041.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.052.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.080.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.063.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.021.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.018.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.098.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.076.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.042.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
Bin.061.fasta_assembly.RAST Genome Taxonomy and taxon_assignment updated with GTDB
37AB-metaSPAdes-DASTool-HQ_90-5.GenomeSet GenomeSet Taxonomy and taxon_assignment updated with GTDB
Links

Genome Extraction from Shotgun Metagenome Sequence Data Nav - 10b

10b. Find Relatives with Species Tree

Our new GenomeSet can be used as input for Insert Set of Genomes Into Species Tree, which will give us an initial phylogenetic placement of the bins.

Note: Make sure that 'Copy public genomes to your workspace' is unchecked because we are not ready to determine which genomes from RefSeq we want to include in downstream comparisons yet.

The current implementation of Insert Genomes into Species Tree has a tendency to overemphasize proximal genomes at the expense of phylogenetic diversity. Future versions will remedy this shortcoming. In the meantime, we have to manually implement this approach to remove excessive genome attractors. We will split the bins into clades based on the initial tree, which we will call Clades A-K. We will use Build GenomeSet to group the bins into a GenomeSet for each clade.

    Clade A: bins 059, 042, 003, 057
    Clade B: bin 080
    Clade C: bin 052
    Clade D: bins 077, 065, 061, 044, 058, 062, 014, 013, 025, 033, 056
    Clade E: bin 063
    Clade F: bin 078
    Clade G: bins 043, 004, 047, 041, 076
    Clade H: bins 029, 011
    Clade I: bins 067, 028, 098, 005
    Clade J: bin 018
    Clade K: bins 021, 037, 053, 039, 020
To get a more proximal RefSeq genomes for each clade, we will rerun Insert Set of Genomes Into Species Tree for each of the eleven clades.
Note: This time we do check 'Copy public genomes into workspace' to allow for later comparison of genomes within the clade.
Add a user-provided GenomeSet to a KBase SpeciesTree.
This app completed without errors in 11m 13s.
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • Bins.003-098_plus36_RefSeq_prox.SpeciesTree.newick
  • Bins.003-098_plus36_RefSeq_prox.SpeciesTree-labels.newick
  • Bins.003-098_plus36_RefSeq_prox.SpeciesTree.png
  • Bins.003-098_plus36_RefSeq_prox.SpeciesTree.pdf
Allows users to create a GenomeSet object.
This app completed without errors in 12s.
Objects
Created Object Name Type Description
Clade_A.GenomeSet GenomeSet KButil_Build_GenomeSet
Summary
genomes in output set Clade_A.GenomeSet: 4
Output from Build GenomeSet - v1.0.1
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/33233
Allows users to create a GenomeSet object.
This app completed without errors in 60s.
Objects
Created Object Name Type Description
Clade_B.GenomeSet GenomeSet KButil_Build_GenomeSet
Summary
genomes in output set Clade_B.GenomeSet: 1
Output from Build GenomeSet - v1.0.1
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/33233
Allows users to create a GenomeSet object.
This app completed without errors in 7s.
Objects
Created Object Name Type Description
Clade_C.GenomeSet GenomeSet KButil_Build_GenomeSet
Summary
genomes in output set Clade_C.GenomeSet: 1
Output from Build GenomeSet - v1.0.1
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/33233
Allows users to create a GenomeSet object.
This app completed without errors in 22s.
Objects
Created Object Name Type Description
Clade_D.GenomeSet GenomeSet KButil_Build_GenomeSet
Summary
genomes in output set Clade_D.GenomeSet: 11
Output from Build GenomeSet - v1.0.1
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/33233
Allows users to create a GenomeSet object.
This app completed without errors in 7s.
Objects
Created Object Name Type Description
Clade_E.GenomeSet GenomeSet KButil_Build_GenomeSet
Summary
genomes in output set Clade_E.GenomeSet: 1
Output from Build GenomeSet - v1.0.1
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/33233
Allows users to create a GenomeSet object.
This app completed without errors in 7s.
Objects
Created Object Name Type Description
Clade_F.GenomeSet GenomeSet KButil_Build_GenomeSet
Summary
genomes in output set Clade_F.GenomeSet: 1
Output from Build GenomeSet - v1.0.1
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/33233
Allows users to create a GenomeSet object.
This app completed without errors in 11s.
Objects
Created Object Name Type Description
Clade_G.GenomeSet GenomeSet KButil_Build_GenomeSet
Summary
genomes in output set Clade_G.GenomeSet: 5
Output from Build GenomeSet - v1.0.1
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/33233
Allows users to create a GenomeSet object.
This app completed without errors in 7s.
Objects
Created Object Name Type Description
Clade_H.GenomeSet GenomeSet KButil_Build_GenomeSet
Summary
genomes in output set Clade_H.GenomeSet: 2
Output from Build GenomeSet - v1.0.1
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/33233
Allows users to create a GenomeSet object.
This app completed without errors in 11s.
Objects
Created Object Name Type Description
Clade_I.GenomeSet GenomeSet KButil_Build_GenomeSet
Summary
genomes in output set Clade_I.GenomeSet: 4
Output from Build GenomeSet - v1.0.1
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/33233
Allows users to create a GenomeSet object.
This app completed without errors in 15s.
Objects
Created Object Name Type Description
Clade_J.GenomeSet GenomeSet KButil_Build_GenomeSet
Summary
genomes in output set Clade_J.GenomeSet: 1
Output from Build GenomeSet - v1.0.1
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/33233
Allows users to create a GenomeSet object.
This app completed without errors in 13s.
Objects
Created Object Name Type Description
Clade_K.GenomeSet GenomeSet KButil_Build_GenomeSet
Summary
genomes in output set Clade_K.GenomeSet: 5
Output from Build GenomeSet - v1.0.1
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/33233
Add a user-provided GenomeSet to a KBase SpeciesTree.
This app completed without errors in 5m 10s.
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • Clade_A_plus_20_RefSeq_prox.SpeciesTree.newick
  • Clade_A_plus_20_RefSeq_prox.SpeciesTree-labels.newick
  • Clade_A_plus_20_RefSeq_prox.SpeciesTree.png
  • Clade_A_plus_20_RefSeq_prox.SpeciesTree.pdf
Add a user-provided GenomeSet to a KBase SpeciesTree.
This app completed without errors in 3m 33s.
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • Clade_B_plus_20_RefSeq_prox.SpeciesTree.newick
  • Clade_B_plus_20_RefSeq_prox.SpeciesTree-labels.newick
  • Clade_B_plus_20_RefSeq_prox.SpeciesTree.png
  • Clade_B_plus_20_RefSeq_prox.SpeciesTree.pdf
Add a user-provided GenomeSet to a KBase SpeciesTree.
This app completed without errors in 6m 44s.
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • Clade_C_plus_20_RefSeq_prox.SpeciesTree.newick
  • Clade_C_plus_20_RefSeq_prox.SpeciesTree-labels.newick
  • Clade_C_plus_20_RefSeq_prox.SpeciesTree.png
  • Clade_C_plus_20_RefSeq_prox.SpeciesTree.pdf
Add a user-provided GenomeSet to a KBase SpeciesTree.
This app completed without errors in 7m 43s.
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • Clade_D_plus_20_RefSeq_prox.SpeciesTree.newick
  • Clade_D_plus_20_RefSeq_prox.SpeciesTree-labels.newick
  • Clade_D_plus_20_RefSeq_prox.SpeciesTree.png
  • Clade_D_plus_20_RefSeq_prox.SpeciesTree.pdf
Add a user-provided GenomeSet to a KBase SpeciesTree.
This app completed without errors in 5m 2s.
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • Clade_E_plus_20_RefSeq_prox.SpeciesTree.newick
  • Clade_E_plus_20_RefSeq_prox.SpeciesTree-labels.newick
  • Clade_E_plus_20_RefSeq_prox.SpeciesTree.png
  • Clade_E_plus_20_RefSeq_prox.SpeciesTree.pdf
Add a user-provided GenomeSet to a KBase SpeciesTree.
This app completed without errors in 3m 13s.
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • Clade_F_plus_20_RefSeq_prox.SpeciesTree.newick
  • Clade_F_plus_20_RefSeq_prox.SpeciesTree-labels.newick
  • Clade_F_plus_20_RefSeq_prox.SpeciesTree.png
  • Clade_F_plus_20_RefSeq_prox.SpeciesTree.pdf
Add a user-provided GenomeSet to a KBase SpeciesTree.
This app completed without errors in 6m 9s.
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • Clade_G_plus_20_RefSeq_prox.SpeciesTree.newick
  • Clade_G_plus_20_RefSeq_prox.SpeciesTree-labels.newick
  • Clade_G_plus_20_RefSeq_prox.SpeciesTree.png
  • Clade_G_plus_20_RefSeq_prox.SpeciesTree.pdf
Add a user-provided GenomeSet to a KBase SpeciesTree.
This app completed without errors in 4m 5s.
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • Clade_H_plus_20_RefSeq_prox.SpeciesTree.newick
  • Clade_H_plus_20_RefSeq_prox.SpeciesTree-labels.newick
  • Clade_H_plus_20_RefSeq_prox.SpeciesTree.png
  • Clade_H_plus_20_RefSeq_prox.SpeciesTree.pdf
Add a user-provided GenomeSet to a KBase SpeciesTree.
This app completed without errors in 4m 4s.
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • Clade_I_plus_20_RefSeq_prox.SpeciesTree.newick
  • Clade_I_plus_20_RefSeq_prox.SpeciesTree-labels.newick
  • Clade_I_plus_20_RefSeq_prox.SpeciesTree.png
  • Clade_I_plus_20_RefSeq_prox.SpeciesTree.pdf
Add a user-provided GenomeSet to a KBase SpeciesTree.
This app completed without errors in 4m 21s.
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • Clade_J_plus_20_RefSeq_prox.SpeciesTree.newick
  • Clade_J_plus_20_RefSeq_prox.SpeciesTree-labels.newick
  • Clade_J_plus_20_RefSeq_prox.SpeciesTree.png
  • Clade_J_plus_20_RefSeq_prox.SpeciesTree.pdf
Add a user-provided GenomeSet to a KBase SpeciesTree.
This app completed without errors in 4m 36s.
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • Clade_K_plus_20_RefSeq_prox.SpeciesTree.newick
  • Clade_K_plus_20_RefSeq_prox.SpeciesTree-labels.newick
  • Clade_K_plus_20_RefSeq_prox.SpeciesTree.png
  • Clade_K_plus_20_RefSeq_prox.SpeciesTree.pdf

10b2. Place Genomes into Phylogenetic Context with Phylum Exemplars

Run the Build Microbial SpeciesTree App to include Phylum Exemplars in the Species Tree.

Note: You can add Phylum exemplars for just Bacterial, just Archaeal, or both. Since our MAGs do not include Archaea, we will add just Bacterial exemplars.
Note: We could have also included the GenomeSets with the RefSeq proximal genomes as "additional genomes" to place into the global species tree.
Build Species Tree for your Microbial Genomes, optionally including Tree Skeleton of Phylum Exemplars
This app completed without errors in 34m 44s.
Objects
Created Object Name Type Description
37AB-MAGs_with_phylum_skeleton.SpeciesTree Tree 37AB MAGs in Phylum Context
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • 37AB-MAGs_with_phylum_skeleton.SpeciesTree.newick
  • 37AB-MAGs_with_phylum_skeleton.SpeciesTree-labels.newick
  • 37AB-MAGs_with_phylum_skeleton.SpeciesTree.png
  • 37AB-MAGs_with_phylum_skeleton.SpeciesTree.pdf

Genome Extraction from Shotgun Metagenome Sequence Data Nav - 10c

10c. Functional Classification of MAGs

We will also want to get a sense of the metabolic roles of our MAGs in the Microbial Community. This can be accomplished by searching for functional marker genes. The Annotate and Distill Asseblies with DRAM App performs this search and summarization of functional markers.

Note: Currently, the DRAM App can either run on Assemblies or Genomes. In this example, so we will provide it the AssemblySet of Bins as input. Both versions of the App will produce Genome objects with the DRAM annotations including with the genes in the object.
Annotate your assembly with DRAM. Annotations will then be distilled to create an interactive functional summary per assembly.
This app completed without errors in 21h 5m 29s.
Objects
Created Object Name Type Description
Bin.003.fasta_assembly_DRAM Genome Annotated Genome
Bin.004.fasta_assembly_DRAM Genome Annotated Genome
Bin.005.fasta_assembly_DRAM Genome Annotated Genome
Bin.011.fasta_assembly_DRAM Genome Annotated Genome
Bin.013.fasta_assembly_DRAM Genome Annotated Genome
Bin.014.fasta_assembly_DRAM Genome Annotated Genome
Bin.018.fasta_assembly_DRAM Genome Annotated Genome
Bin.020.fasta_assembly_DRAM Genome Annotated Genome
Bin.021.fasta_assembly_DRAM Genome Annotated Genome
Bin.025.fasta_assembly_DRAM Genome Annotated Genome
Bin.028.fasta_assembly_DRAM Genome Annotated Genome
Bin.029.fasta_assembly_DRAM Genome Annotated Genome
Bin.033.fasta_assembly_DRAM Genome Annotated Genome
Bin.037.fasta_assembly_DRAM Genome Annotated Genome
Bin.039.fasta_assembly_DRAM Genome Annotated Genome
Bin.041.fasta_assembly_DRAM Genome Annotated Genome
Bin.042.fasta_assembly_DRAM Genome Annotated Genome
Bin.043.fasta_assembly_DRAM Genome Annotated Genome
Bin.044.fasta_assembly_DRAM Genome Annotated Genome
Bin.047.fasta_assembly_DRAM Genome Annotated Genome
Bin.052.fasta_assembly_DRAM Genome Annotated Genome
Bin.053.fasta_assembly_DRAM Genome Annotated Genome
Bin.056.fasta_assembly_DRAM Genome Annotated Genome
Bin.057.fasta_assembly_DRAM Genome Annotated Genome
Bin.058.fasta_assembly_DRAM Genome Annotated Genome
Bin.059.fasta_assembly_DRAM Genome Annotated Genome
Bin.061.fasta_assembly_DRAM Genome Annotated Genome
Bin.062.fasta_assembly_DRAM Genome Annotated Genome
Bin.063.fasta_assembly_DRAM Genome Annotated Genome
Bin.065.fasta_assembly_DRAM Genome Annotated Genome
Bin.067.fasta_assembly_DRAM Genome Annotated Genome
Bin.076.fasta_assembly_DRAM Genome Annotated Genome
Bin.077.fasta_assembly_DRAM Genome Annotated Genome
Bin.078.fasta_assembly_DRAM Genome Annotated Genome
Bin.080.fasta_assembly_DRAM Genome Annotated Genome
Bin.098.fasta_assembly_DRAM Genome Annotated Genome
37AB-metaSPAdes-DASTool-HQ_90-5-DRAM.GenomeSet GenomeSet DRAM annotations of 37AB HQ Bins
Summary
Here are the results from your DRAM run.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • annotations.tsv - DRAM annotations in a tab separate table format
  • genes.fna - Genes as nucleotides predicted by DRAM with brief annotations
  • genes.faa - Genes as amino acids predicted by DRAM with brief annotations
  • genes.gff - GFF file of all DRAM annotations
  • rrnas.tsv - Tab separated table of rRNAs as detected by barrnap
  • trnas.tsv - Tab separated table of tRNAs as detected by tRNAscan-SE
  • genbank.tar.gz - Compressed folder of output genbank files
  • product.tsv - DRAM product in tabular format
  • metabolism_summary.xlsx - DRAM metabolism summary tables
  • genome_stats.tsv - DRAM genome statistics table

10c2. HMMER profiling and gene set extraction

In additional to profiling functional suites of interest, explicit gene set capture into FeatureSet objects for downstream analysis can be accomplished with HMMER scans of the MAGs.

Other, more targeted, gene family model annotation approaches yield functional assignments that can be used to profile collections of genomes. Such HMM-based gene family suites on KBase that can be scanned with HMMER include the dbCAN2 collection of hidden markov models (HMMs), built from the Carbohydrate Active Enzyme (CAZy) database. These approaches additionally yield the genes found for each gene family output as FeatureSet objects which can then in turn be used in phylogenomics (e.g., the “Build a Gene Tree” App and other comparative analysis tools in KBase. For example, with these Compost MAGs, where lignocellulose degradation gene families are one of the primary areas of interest, we can scan with the dbCAN2 models developed for the CAZy domain families.

Search for matches to dbCAN HMMs of CAZy carbohydrate active enzyme families using HMMER 3
This app completed without errors in 11m 8s.
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • HMMER_dbCAN_Search.TAB.zip

Summarize GenomeSet

GenomeSets of MAGs, and other Genome types, can be summarized in a table, useful for selection of a subset for additional analyses. Standard genome characteristics, such as total length in base-pairs, number of contigs, N50 statistic, G+C%, and a count of protein encoding genes (CDS). Additionally, tRNA and rRNA counts are included, useful if needing to determine whether a given MAG meets the MIMAG standard41. Completeness and contamination scores are also available for this purpose. Unfortunately, when read libraries used in assembly consist only of current short read sequencing technologies in conjunction with the metagenome assemblers described here, the duplication of the ribosomal RNA genes usually precludes assembly of the longer ribosomal RNA genes, meaning it is common that even if a MAG possesses high completeness and low contamination, it is lacking the 16S gene required to meet the MIMAG standard. Nonetheless, for analyses based on the protein coding complement of the MAG, it may still prove useful.

Taxonomy and a basic bioelement active gene family summary can also be included in the summary table for a more thorough overview of the Genomes in the GenomeSet. If the Genome objects have already been taxonomically classified with GTDB-Tk, then that taxonomy is included in the table. Options are provided for adding CheckM completeness and contamination scores, as well as bioelement enzyme gene family summaries using the MicroTrait Bioelement HMM suite (Karaoz U and Brodie EL. "microTrait: a toolset for a trait-based representation of microbial genomes". https://github.com/ukaraoz/microtrait ).

View Genome summaries within a GenomeSet
This app completed without errors in 1h 22m 31s.
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/33233
  • GenomeSet_summary.tsv

Summary and Future Directions

This Narrative Tutorial covers how to generate annotated genomes and species predictions from raw metagenomic reads. Taxonomic abundance can be generated based on protein similarity from the raw reads using Kaiju or from annotated genome similarity to reference genomes through the creation of species trees.

Genome extraction and species prediction are just the beginning of how metagenomic samples can be analyzed within KBase. Annotated genomes can be used for metabolic modeling, comparative phylogenomics, functional profiling, and more.

Reference Literature

  1. Wu YW, Higgins B, Yu C, Reddy AP, Ceballos S, Joh LD, Simmons BA, Singer SW, VanderGheynst JS. Ionic Liquids Impact the Bioenergy Feedstock-Degrading Microbiome and Transcription of Enzymes Relevant to Polysaccharide Hydrolysis. mSystems. 2016 Dec 13;1(6). pii: e00120-16. eCollection 2016 Nov-Dec. doi:10.1128/mSystems.00120-16 https://www.ncbi.nlm.nih.gov/pubmed/27981239
  2. Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc and https://github.com/s-andrews/FastQC
  3. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30: 2114–2120. doi:10.1093/bioinformatics/btu170 http://www.ncbi.nlm.nih.gov/pubmed/24695404
  4. Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications. 2016;7: 11257. doi:10.1038/ncomms11257 http://www.ncbi.nlm.nih.gov/pubmed/27071849
  5. Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics. 2011;12: 385. doi:10.1186/1471-2105-12-385http://www.ncbi.nlm.nih.gov/pubmed/21961884
  6. Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017 May;27(5):824-834. doi: 10.1101/gr.213959.116. https://www.ncbi.nlm.nih.gov/pubmed/28298430
  7. Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015;31: 1674–1676. doi:10.1093/bioinformatics/btv033 http://www.ncbi.nlm.nih.gov/pubmed/25609793
  8. Peng Y, Leung HCM, Yiu SM, Chin FYL. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28: 1420–1428. doi:10.1093/bioinformatics/bts174 https://www.ncbi.nlm.nih.gov/pubmed/22495754
  9. Wu Y-W, Simmons BA, Singer SW. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics. 2016;32: 605–607. doi:10.1093/bioinformatics/btv638 https://www.ncbi.nlm.nih.gov/pubmed/26515820
  10. Wu Y-W, Tang Y-H, Tringe SG, Simmons BA, Singer SW. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome. 2014;2: 26. doi:10.1186/2049-2618-2-26 https://microbiomejournal.biomedcentral.com/articles/10.1186/2049-2618-2-26
  11. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research. 2015;25: 1043–1055. doi:10.1101/gr.186072.114 http://genome.cshlp.org/content/25/7/1043.long
  12. Brettin T, Davis J, Disz T et al. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes. Sci Rep 5, 8365 (2015). doi: 10.1038/srep08365 https://www.nature.com/articles/srep08365
  13. Chaumeil PA, Mussig AJ, Hugenholtz P, Parks DH. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics, Volume 36, Issue 6, 15 March 2020, Pages 1925–1927, doi: 10.1093/bioinformatics/btz848 https://academic.oup.com/bioinformatics/article/36/6/1925/5626182
  14. Price MN, Dehal PS, Arkin AP. FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. Poon AFY, editor. PLoS ONE. 2010;5: e9490. doi:10.1371/journal.pone.0009490 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2835736/
  15. Shaffer M, Borton MA, et al. DRAM for distilling microbial metabolism to automate the curation of microbiome function. Nucleic Acids Research, Volume 48, Issue 16, 18 September 2020, Pages 8883–8900, doi: 10.1093/nar/gkaa621 https://academic.oup.com/nar/article/48/16/8883/5884738

Known Issues & Update History

Update History

09-16-2020

  • Added MetaBAT2, CONCOCT, DAS-Tool, Filter Contigs with CheckM, GTDB-Tk, Build Microbial SpeciesTree, and DRAM Apps.
  • Reran most Apps with latest version.

02-26-2018

  • Initial draft of tutorial built

Feedback & Helpdesk

Was this Narrative helpful? Please provide feedback on this tutorial: https://forms.gle/Di3riXpoF1FLnk7J7

If you have a question about one of our apps, need to report a bug or have another system-related query, please join our Help Board and post a ticket. Learn about how to do this here: http://kbase.us/help-board.

Released Apps

  1. Annotate and Distill Assemblies with DRAM
    • DRAM source code
    • DRAM documentation
    • DRAM publication
  2. Annotate Multiple Microbial Assemblies with RASTtk - v1.073
    • [1] Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, et al. The RAST Server: Rapid Annotations using Subsystems Technology. BMC Genomics. 2008;9: 75. doi:10.1186/1471-2164-9-75
    • [2] Overbeek R, Olson R, Pusch GD, Olsen GJ, Davis JJ, Disz T, et al. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res. 2014;42: D206 D214. doi:10.1093/nar/gkt1226
    • [3] Brettin T, Davis JJ, Disz T, Edwards RA, Gerdes S, Olsen GJ, et al. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes. Sci Rep. 2015;5. doi:10.1038/srep08365
    • [4] Kent WJ. BLAT The BLAST-Like Alignment Tool. Genome Res. 2002;12: 656 664. doi:10.1101/gr.229202
    • [5] Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25: 3389-3402. doi:10.1093/nar/25.17.3389
    • [6] Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25: 955 964.
    • [7] Cobucci-Ponzano B, Rossi M, Moracci M. Translational recoding in archaea. Extremophiles. 2012;16: 793 803. doi:10.1007/s00792-012-0482-8
    • [8] Meyer F, Overbeek R, Rodriguez A. FIGfams: yet another set of protein families. Nucleic Acids Res. 2009;37 6643-54. doi:10.1093/nar/gkp698.
    • [9] van Belkum A, Sluijuter M, de Groot R, Verbrugh H, Hermans PW. Novel BOX repeat PCR assay for high-resolution typing of Streptococcus pneumoniae strains. J Clin Microbiol. 1996;34: 1176 1179.
    • [10] Croucher NJ, Vernikos GS, Parkhill J, Bentley SD. Identification, variation and transcription of pneumococcal repeat sequences. BMC Genomics. 2011;12: 120. doi:10.1186/1471-2164-12-120
    • [11] Hyatt D, Chen G-L, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11: 119. doi:10.1186/1471-2105-11-119
    • [12] Delcher AL, Bratke KA, Powers EC, Salzberg SL. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007;23: 673 679. doi:10.1093/bioinformatics/btm009
    • [13] Akhter S, Aziz RK, Edwards RA. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res. 2012;40: e126. doi:10.1093/nar/gks406
  3. Assemble Reads with IDBA-UD - v1.1.3
    • Peng Y, Leung HCM, Yiu SM, Chin FYL. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28: 1420 1428. doi:10.1093/bioinformatics/bts174
  4. Assemble Reads with MEGAHIT v1.2.9
    • Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015;31: 1674 1676. doi:10.1093/bioinformatics/btv033
  5. Assemble Reads with metaSPAdes - v3.15.3
    • Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017; 27:824 834. doi: 10.1101/gr.213959.116
    • Prjibelski A, Antipov D, Meleshko D, Lapidus A, Korobeynikov A. Using SPAdes De Novo Assembler. Curr Protoc Bioinformatics. 2020 Jun;70(1):e102. doi: 10.1002/cpbi.102.
  6. Assess Genome Quality with CheckM - v1.0.18
    • Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25: 1043 1055. doi:10.1101/gr.186072.114
    • CheckM source:
    • Additional info:
  7. Assess Read Quality with FastQC - v0.11.9
    • FastQC source: Bioinformatics Group at the Babraham Institute, UK.
  8. Bin Contigs using MaxBin2 - v2.2.4
    • Wu Y-W, Simmons BA, Singer SW. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics. 2016;32: 605 607. doi:10.1093/bioinformatics/btv638 (2) 1. Wu Y-W, Tang Y-H, Tringe SG, Simmons BA, Singer SW. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome. 2014;2: 26. doi:10.1186/2049-2618-2-26
    • Wu Y-W, Tang Y-H, Tringe SG, Simmons BA, Singer SW. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome. 2014;2: 26. doi:10.1186/2049-2618-2-26
    • Maxbin2 source:
    • Maxbin source:
  9. Build GenomeSet - v1.7.6
    • Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163
  10. Classify Microbes with GTDB-Tk - v1.7.0
    • Pierre-Alain Chaumeil, Aaron J Mussig, Philip Hugenholtz, Donovan H Parks, GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database, Bioinformatics, Volume 36, Issue 6, 15 March 2020, Pages 1925 1927. DOI: https://doi.org/10.1093/bioinformatics/btz848
    • Parks, D., Chuvochina, M., Waite, D. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol 36, 996 1004 (2018). DOI: https://doi.org/10.1038/nbt.4229
    • Parks DH, Chuvochina M, Chaumeil PA, Rinke C, Mussig AJ, Hugenholtz P. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat Biotechnol. 2020;10.1038/s41587-020-0501-8. DOI:10.1038/s41587-020-0501-8
    • Rinke C, Chuvochina M, Mussig AJ, Chaumeil PA, Dav n AA, Waite DW, Whitman WB, Parks DH, and Hugenholtz P. A standardized archaeal taxonomy for the Genome Taxonomy Database. Nat Microbiol. 2021 Jul;6(7):946-959. DOI:10.1038/s41564-021-00918-8
    • Matsen FA, Kodner RB, Armbrust EV. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics. 2010;11:538. Published 2010 Oct 30. doi:10.1186/1471-2105-11-538
    • Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun. 2018;9(1):5114. Published 2018 Nov 30. DOI:10.1038/s41467-018-07641-9
    • Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11:119. Published 2010 Mar 8. DOI:10.1186/1471-2105-11-119
    • Price MN, Dehal PS, Arkin AP. FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One. 2010;5(3):e9490. Published 2010 Mar 10. DOI:10.1371/journal.pone.0009490 link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2835736/
    • Eddy SR. Accelerated Profile HMM Searches. PLoS Comput Biol. 2011;7(10):e1002195. DOI:10.1371/journal.pcbi.1002195
  11. Classify Taxonomy of Metagenomic Reads with Kaiju - v1.7.3
    • Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun. 2016;7: 11257. doi:10.1038/ncomms11257
    • Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics. 2011;12: 385. doi:10.1186/1471-2105-12-385
    • Kaiju Homepage:
    • Kaiju DBs from:
    • Github for Kaiju:
    • Krona homepage:
    • Github for Krona:
  12. Compare Assembled Contig Distributions - v1.1.2
    • Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163
  13. Extract Bins as Assemblies from BinnedContigs - v1.0.2
    • Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163
  14. Filter Bins by Quality with CheckM - v1.0.18
    • Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25: 1043 1055. doi:10.1101/gr.186072.114
    • CheckM source:
    • Additional info:
  15. Import FASTQ/SRA File as Reads from Staging Area
    no citations
  16. Insert Set of Genomes Into SpeciesTree - v2.2.0
    • Price MN, Dehal PS, Arkin AP. FastTree 2 Approximately Maximum-Likelihood Trees for Large Alignments. PLoS One. 2010;5. doi:10.1371/journal.pone.0009490
  17. Merge Reads Libraries - v1.0.1
    • Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163
  18. MetaBAT2 Contig Binning - v1.7
    • Kang DD, Froula J, Egan R, Wang Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ. 2015;3: e1165. doi:10.7717/peerj.1165
    • MetaBAT2 source:
  19. Search with dbCAN2 HMMs of CAZy families - v10
    • Eddy SR. Accelerated Profile HMM Searches. PLOS Computational Biology. 2011;7: e1002195. doi:10.1371/journal.pcbi.1002195
    • Huang L, Zhang H, Wu P, Entwistle S, Li X, Yohe T, Yi H, Yang Z, Yin Y. dbCAN-seq: a database of carbohydrate-active enzyme (CAZyme) sequence and annotation. Nucleic Acids Research. 2018;46: D516-D521. doi:10.1093/nar/gkx894
    • HMMER v3.3 source:
  20. Trim Reads with Trimmomatic - v0.36
    • Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30: 2114 2120. doi:10.1093/bioinformatics/btu170

Apps in Beta

  1. Bin Contigs using CONCOCT - v1.1
    • Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, Lahti L, Loman NJ, Andersson AF, Quince C. Binning metagenomic contigs by coverage and composition. Nature Methods. 2014;11: 1144-1146. doi:10.1038/nmeth.3103
    • CONCOCT source:
  2. Build Microbial SpeciesTree - v1.6.0
    • Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163
  3. Optimize Bacterial or Archaeal Binned Contigs using DAS Tool - v1.1.2
    • Sieber CMK, Probst AJ, Sharrar A, Thomas BC, Hess M, Tringe SG, Banfield JF. Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. 2018; 3(7): 836-843. doi:10.1038/s41564-018-0171-1
    • DAS_Tool source:
    • Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11: 119. doi:10.1186/1471-2105-11-119
    • Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10: 421. doi:10.1186/1471-2105-10-421
    • Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nature Methods. 2015;12: 59-60. doi:10.1038/nmeth.3176
    • Pullseq:
    • R: A Language and Environment for Statistical Computing:
    • Ruby: A Programmers Best Friend:
  4. Summarize GenomeSet - v1.8.0
    no citations