Overview¶

KBase has powerful tools for extracting microbial genomes from metagenomes and performing phylogenomic analysis and metabolic modeling. These tools can be used to predict key media ingredients for isolating uncultured members of microbiomes. Essential to this process are high-quality genomes extracted from metagenomic assemblies, and Kbase has a tool for assessing the quality of genomes too.¶

Below we identify growth factors for a myxobacteria ("slime bacteria") yet to be isolated from the rhizosphere of Miscanthus xgiganteus (hybrid of "Silvergrass"), cultivated at the Kellogg Biological Station in Michigan. Data was transferred with Globus from [JGI-IMG].

Silver and Gold Narrative Set¶

This narrative and the "KBase Gold Case Study: Can you find Delftia?" make up the Silver and Gold Narrative Set for teaching metagenomics concepts to students in the BIT 477/577 course at North Carolina State University.¶

Determining Media Formulation Requirements for Isolation of Microbiome Constituents¶

This tutorial will guide the user through the process of extracting and annotating high-quality genomes from a metagenomic data, performing phylogenomic analysis, building a metabolic model, and using these to predict nutrient requirments for growth and isolation of corresponding microbes.¶

Author: Jason M. Whitham (jmwhitha@ncsu.edu)¶

Main Lessons from this Narrative¶

Learn how to perform quality control of read libraries
Learn how to predict taxonomic population structure of environmental shotgun reads
Learn how to assemble metagenomes
Learn how to compare quality of assemblies
Learn how to bin metagenomic contigs into putative lineages (metagenome-assembled genomes, or MAGs)
Learn how to assess MAG quality and extract them
Learn how to annotate genes of extracted MAGs
Learn how to place MAGs into a species tree
Learn how to build metabolic models with extracted annotated MAGs and investigate metabolic pathways

Breakdown of Narrative Sections¶

Read Hygiene
Classify Taxonomy
Assemble Contigs
Compare Contigs
Bin Contigs
Bin Quality Assessment
Extract Individual Assemblies
Annotate Genomes
Find Relatives
Build a Metabolic Model without Gapfilling
Build a Metabolic Model with Gapfilling

A Word on Timing¶

KBase has a finite number of servers that are shared by many customers, sometimes resulting in lengthy queue times. Furthermore, some computations take a long time, even when applications are allocated generous amounts of RAM and processors. These limitations prevent us from being able to execute and finish several steps in this narrative within a single class period. Lessons will therefore be much like a cooking show where the audience learns how to prepare the dish, they see the food go in the oven, and a fully cooked product is displayed a moment later. Like a cooking show, we won't make you "wait for the bake" during class, but inform you of expected wait times for when you "try the recipe".¶

A precise time for each step cannot be provided since queue and processing times will vary. The table below is meant to give you a sense of whether to check your narrative after sending a couple of emails, cooking a meal, going on a day hike, or after returning from a long weekend of visiting with friends or family. Overall, it will probably take a couple of weeks to complete the whole narrative from beginning to end.¶

Applications	Magnitude
Read Import, Trimming, Quality Check, and Subsampling	Hours
Taxonomy Classification	Hours
Contig Assembly	Days
Assembly Comparison	Minutes
Binning Contigs	Hours
Quality Assessment of Bins	Minutes
Bin Extraction	Minutes
Microbial Assembly Annotation	Minutes
Genome Insertion into Species Tree	Minutes
Metabolic Model Build	Minutes

Links to Kbase Applications for More Information¶

1. Read Hygiene¶

Read hygiene means checking the quality of your data and removing errors if possible. This is important because the colloquialism "junk in, junk out" is true. Before we can check data quality, we need to get data.¶

I have already imported paired-end reads in FASTQ format. An import application was automatically chosen when I select the file format in the Kbase data import staging area. To get to the staging area, click the arrow pointing to the right in the DATA panel and then click the IMPORT tab. Here, I selected the format "fastq reads" from the drop down and clicked the upload arrow directly beside it. These same steps can be used to import your own data into the narrative.¶

If you are interested in using your own data for this narrative, you will first need to load your data into the staging area . If your dataset is large, follow the guide to transferring large datasets with Globus. You can also obtain datasets from Kbase or datasets from the Joint Genomics Institute.¶

The remaining applications in this tutorial are preconfigured. Each application has its own View Configure tab. As we go through this narrative, practice inserting the same applications below the ones that are in the narrative by clicking the arrow pointing to the right in the APPS panel and searching the application by name. Practice configuring applications by coping the configurations from the prepopulated narrative applications. This experience will help you become familiar with the Kbase platform.¶

Once the reads are uploaded, you can check the quality of paired-end reads with the FastQC application and improve its quality with the Trimmomatic application. Run FastQC a second time after the reads are processed with Trimmomatic to verify the improvement. You may be surprised by what you find!¶

2. Classify Taxonomy¶

The Kaiju application predicts the microbial composition based on similarities in protein sequences of input reads and a database of proteins sequences. That is what you will do in this section of the narrative. Later in the narrative, you will generate a species tree, which predicts microbial phylogeny based on your assembled, extracted and annotated genome sequences. You can then compare the phylogenic and taxonomic predictions.¶

You will assemble the reads in the following step. Unfortunately, KBase assembly applications currently have an upper limit of between 180,263,840 and 240,351,788 paired reads depending on complexity. Kbase developers are working on this problem but haven’t yet implemented a solution. For now, the Randomly Subsample Reads application enables us to subsample our reads such that they are of a similar composition but not too deep for the assemblers. [Split Reads into Subsets] is another option. You will use Kaiju to verify that the composition of reads are similar before and after subsampling.¶

3. Assemble Contigs¶

KBase offers several commonly used metagenomic assemblers. You will assemble reads with metaSPAdes, MEGAHIT, and IDBA_UD. It's good to try multiple assemblers since each uses a different algorithm, and one does not consistently perform better than the others. Application settings can also be tweaked to improve one output metric at the expense of another. For comparison of assemblies, you will configure all assemblers to have a minimum contig length of 1000 bp.¶

4. Compare Contigs¶

Kbase has a convenient application for comparing assemlies called Compare Assembled Contig Distribution. Use this application to see which one is best for downstream analysis. You are looking for assemblies with more assembled bases and longer contigs since these are the key factors that will affect the quality of the genome(s) you extract from the metagenome (more about this in the next step). The table below summarizes important mathmatical values that quantify these key factors.¶

Value	Definition	Further Explanation
N50	The sequence length of the shortest contig at 50% of the total assembly length.	About half of all assembled bases will be contained in all contigs (ordered from longest to shortest) longer than the N50 contig and also shorter than the N50 contig.
L50	the smallest number of contigs whose length sum makes up half of genome size.	A quantity of contigs, not the length of a contig or set of contigs.
Nx	The sequence length of the shortest contig at x% of the total assembly length.	Common Nx are N50, N75, and N90
Lx	The smallest number of contigs whose length sum makes up x% of the genome size.	Common Lx are L50, L75, and L90
NG50	The sequence length of the shortest contig at 50% of the known genome length.	An estimated genome length is sometimes used.

5. Bin Contigs¶

Having assembled the contigs, the next step is to separate them into bins based on patterns including contig abundance and tetramer frequency. Contigs with similar abundances and tetramer frequencies will theoretically be from the same microbial genome. That is why these bins of contigs are also known as metagenome assembled genomes (MAGs)¶

MaxBin2 and MetaBAT2 are two commonly used binning softwares with different algorithms. Rather than just using one. Test both to see which produces better bins. One does not always outperform the other always or in all metrics. Use a minimum contig length of 1500 bp for a fair comparison, since that is the lowest MetaBat2 will allow. Optional: Try a minimum contig length of 1000 bp with Maxbin2 to see if bin statistics are improved.¶

6. Bin Quality Assessment¶

Assessments like the number of bins and number of binned contigs, outputs of the MaxBin2 and MetaBAT2 applications, do not tell you the quality of generated bins. CheckM is a widely used application for this purpose, and will help you find a high-quality bin for downstream analyses.¶

7. Extract Individual Assemblies¶

Extract Bins as Assemblies from BinnedContigs performs the simple task of creating an Kbase-platform object from a specified bin or bins for input into downstream applications.¶

8. Annotate Genomes¶

Whether a genome is fragmented into many contigs or a contigious circular chromosome, the genes can and must be annotated by the Rapid Annotation Subsystem Technology (RAST) pipeline before a metabolic model can be built in Kbase. To do this, submit the extracted high-quality MAG to the Annotate Multiple Microbial Assemblies application.¶

9. Find Relatives¶

Our mystery microbe is a myxobacteria¶

Phylogenomic analysis places the MAG between one cluster with Stigmatella aurantica, Hyalangium minutum, Cystobacter fuscus, Archangium gephyra, Corallococcus coralloides, Myxococcus stipitatus, Myxococcus xanthus, and Myxococcus fulvus and another cluster with Vulgatibacter incomptus, Anaeromyxobacter sp. Fw109-5, and Anaeromyxobacter dehalogenans. Placement between these evolutionary clusters helps us to anticipate that corresponding microbe will have shared, similar or intermediate phenotypes with the microbes in these clusters.¶

Looking back at our taxonomic classification of reads, we find that a large portion of unassembled reads are classified as the myxobacteria Sorangium cellulosum. This myxobacteria does not even appear as a close relative to our myxobacteria in our phylogenomic analysis. Furthermore, Sorangium cellulosum has the largest bacterial genome sequenced to date, 13,033,779 base pairs. Bin analysis with CheckM and annotation with RAST suggest the genome of our myxobacteria is near complete with approximately 5M base pairs. Taxonomic classification of shotgun reads was useful for verifying that read subsampling resulted in a similar distribution of the original set and could therefore be used for downstream analysis, but was not an accurate way of characterizing the population of species in the microbiome.¶

10. Build a Metabolic Model without Gapfilling¶

Below are a couple of formulations used by Leibniz Institute DSMZ - German Collection of Microorganisms and Cell Cultures GmbH (Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH) for growth of myxobacteria. Either Vitamin B12 (cobalamin) or the derivative cyanocobalamin must be included. This is because myxobacteria cannot biosynthesize vitamin B12. Verify this by building a genome-scale metabolic model with the Build Metabolic Model application and the gapfilling option deselected. Navigate to the KEGG porphyrin and chlorophyll metabolism pathway map to see if the pathway is missing.¶

11. Build a Metabolic Model with Gapfilling¶

Draft MAGs are often made up of tens or hundreds of contigs. While the contigs of high-quality MAGs will contain most of the core, universal genes, some genes will be missing. The absence of metabolic genes in contigs will show up as gaps in metabolic pathways in a metabolic model if gapfilling is not used.¶

Gapfilling adds the genes that were missing in a pathway back to the metabolic model. In general, an optimization algorithm identifies the minimal set of reactions that must be added to each model that would otherwise prevent the production of biomass components. Details can be found [here]. Use gapfilling to see what else is predicted to be necessary for growth of the myxobacteria.¶

Gapfilling Reactions Revealed Factors for Growth¶

Succinate dehydrogenase is missing in the TCA cycle. TCA cycle compounds - citrate, malate, succinate, and others - were found to stimulate growth of Myxococcus xanthus [19].¶

A biosynthesis step for the polyamine spermidine was gapfilled in this model. Spermidine at 125 ug/ml was found to be stimulatory for Myxococcus xanthus [19].¶

Gapfilling is not always helpful though. Valine, leucine and isoleucine are building blocks of proteins and therefore critical to biological processes. Biosynthesis pathways for these amino acids are missing in the model, and were not gapfilled. These amino acids must be added to media in a purified form or in a complex form like yeast extract.¶

There are several gapfilled reactions for biosynthesis of the coenzyme ubiquinone, involved in respiration of many organisms. No ingredient supplementation of media is required though since myxobacteria generally use other quinones including MK-8 [20].¶

Summary and Future Directions¶

This narrative tutorial covers how to utilize shotgun metagenomic data to predict necessary media ingredients for isolation and growth of microbiome members whose MAGs are high-quality.¶

Once a microbe is isolated, Kbase applications including flux balance analysis can be used in conjunction with growth experiments to refine media formulations for various purposes including optimization of growth and fermentation product yields.¶

References¶

Andrews, S. FastQC: A Quality Control Tool for High Throughput Sequence Data. 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163 https://www.nature.com/articles/nbt.4163
Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30: 2114–2120. doi:10.1093/bioinformatics/btu170 http://www.ncbi.nlm.nih.gov/pubmed/24695404
Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications. 2016;7: 11257. doi:10.1038/ncomms11257 http://www.ncbi.nlm.nih.gov/pubmed/27071849
Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics. 2011;12: 385. doi:10.1186/1471-2105-12-385http://www.ncbi.nlm.nih.gov/pubmed/21961884
Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017 May;27(5):824-834. doi: 10.1101/gr.213959.116. https://www.ncbi.nlm.nih.gov/pubmed/28298430
Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015;31: 1674–1676. doi:10.1093/bioinformatics/btv033 http://www.ncbi.nlm.nih.gov/pubmed/25609793
Peng Y, Leung HCM, Yiu SM, Chin FYL. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28: 1420–1428. doi:10.1093/bioinformatics/bts174 https://www.ncbi.nlm.nih.gov/pubmed/22495754
Wu Y-W, Simmons BA, Singer SW. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics. 2016;32: 605–607. doi:10.1093/bioinformatics/btv638 https://www.ncbi.nlm.nih.gov/pubmed/26515820
Wu Y-W, Tang Y-H, Tringe SG, Simmons BA, Singer SW. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome. 2014;2: 26. doi:10.1186/2049-2618-2-26 https://microbiomejournal.biomedcentral.com/articles/10.1186/2049-2618-2-26
Kang DD, Froula J, Egan R, Wang Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ. 2015;3: e1165. doi:10.7717/peerj.1165 https://doi.org/10.7717/peerj.1165
Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research. 2015;25: 1043–1055. doi:10.1101/gr.186072.114 http://genome.cshlp.org/content/25/7/1043.long
Price MN, Dehal PS, Arkin AP. FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. Poon AFY, editor. PLoS ONE. 2010;5: e9490. doi:10.1371/journal.pone.0009490 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2835736/
Henry CS, DeJongh M, Best AA, Frybarger PM, Linsay B, Stevens RL. High-throughput generation, optimization and analysis of genome-scale metabolic models. Nat Biotechnol. 2010;28: 977 982. doi:10.1038/nbt.1672 https://www.ncbi.nlm.nih.gov/pubmed/20802497
Overbeek R, Olson R, Pusch GD, Olsen GJ, Davis JJ, Disz T, et al. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res. 2014;42: D206 D214. doi:10.1093/nar/gkt1226 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965101/
Latendresse M. Efficiently gap-filling reaction networks. BMC Bioinformatics. 2014;15: 225. doi:10.1186/1471-2105-15-225 https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-225
Dreyfuss JM, Zucker JD, Hood HM, Ocasio LR, Sachs MS, Galagan JE. Reconstruction and Validation of a Genome-Scale Metabolic Model for the Filamentous Fungus Neurospora crassa Using FARM. PLOS Computational Biology. 2013;9: e1003126. doi:10.1371/journal.pcbi.1003126 https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003126
Mahadevan R, Schilling CH. The effects of alternate optimal solutions in constraint-based genome-scale metabolic models. Metab Eng. 2003;5: 264 276. https://www.ncbi.nlm.nih.gov/pubmed/14642354
Bretscher A P , Kaiser D. Nutrition of Myxococcus xanthus, a fruiting myxobacterium. J Bac. 1978; 133 (2) 763-768 https://jb.asm.org/content/133/2/763.short
Yamamoto E, Muramatsu H, Nagai K. Vulgatibacter incomptus gen. nov., sp. nov. and Labilithrix luteola gen. nov., sp. nov., two myxobacteria isolated from soil in Yakushima Island, and the description of Vulgatibacteraceae fam. nov., Labilitrichaceae fam. nov. and Anaeromyxobacteraceae fam. nov. Int J Syst Evol Microbiol. 2014;64(Pt 10):3360-3368. doi:10.1099/ijs.0.063198-0 https://pubmed.ncbi.nlm.nih.gov/25048208/

Created Object Name	Type	Description
trimmed_SilvergrassMBReads_paired	PairedEndLibrary	Trimmed Reads
trimmed_SilvergrassMBReads_unpaired_fwd	SingleEndLibrary	Trimmed Unpaired Forward Reads
trimmed_SilvergrassMBReads_unpaired_rev	SingleEndLibrary	Trimmed Unpaired Reverse Reads

Created Object Name	Type	Description
SilvergrassMB_Annotated_MAG_Gapfilled_MetabolicModel	FBAModel	FBAModel-12 SilvergrassMB_Annotated_MAG_Gapfilled_MetabolicModel
SilvergrassMB_Annotated_MAG_Gapfilled_MetabolicModel.gf.0	FBA	FBA-13 SilvergrassMB_Annotated_MAG_Gapfilled_MetabolicModel.gf.0

Overview¶

Below we identify growth factors for a myxobacteria ("slime bacteria") yet to be isolated from the rhizosphere of Miscanthus xgiganteus (hybrid of "Silvergrass"), cultivated at the Kellogg Biological Station in Michigan. Data was transferred with Globus from [JGI-IMG].

Silver and Gold Narrative Set¶

This narrative and the "KBase Gold Case Study: Can you find Delftia?" make up the Silver and Gold Narrative Set for teaching metagenomics concepts to students in the BIT 477/577 course at North Carolina State University.¶

Determining Media Formulation Requirements for Isolation of Microbiome Constituents¶

This tutorial will guide the user through the process of extracting and annotating high-quality genomes from a metagenomic data, performing phylogenomic analysis, building a metabolic model, and using these to predict nutrient requirments for growth and isolation of corresponding microbes.¶

Author: Jason M. Whitham (jmwhitha@ncsu.edu)¶

Main Lessons from this Narrative¶

Breakdown of Narrative Sections¶

A Word on Timing¶

Links to Kbase Applications for More Information¶

1. Read Hygiene¶

Read hygiene means checking the quality of your data and removing errors if possible. This is important because the colloquialism "junk in, junk out" is true. Before we can check data quality, we need to get data.¶

2. Classify Taxonomy¶

3. Assemble Contigs¶

4. Compare Contigs¶

5. Bin Contigs¶

6. Bin Quality Assessment¶

Assessments like the number of bins and number of binned contigs, outputs of the MaxBin2 and MetaBAT2 applications, do not tell you the quality of generated bins. CheckM is a widely used application for this purpose, and will help you find a high-quality bin for downstream analyses.¶

7. Extract Individual Assemblies¶

Extract Bins as Assemblies from BinnedContigs performs the simple task of creating an Kbase-platform object from a specified bin or bins for input into downstream applications.¶

8. Annotate Genomes¶

9. Find Relatives¶

Our mystery microbe is a myxobacteria¶

10. Build a Metabolic Model without Gapfilling¶

11. Build a Metabolic Model with Gapfilling¶

Gapfilling Reactions Revealed Factors for Growth¶

Succinate dehydrogenase is missing in the TCA cycle. TCA cycle compounds - citrate, malate, succinate, and others - were found to stimulate growth of Myxococcus xanthus [19].¶

A biosynthesis step for the polyamine spermidine was gapfilled in this model. Spermidine at 125 ug/ml was found to be stimulatory for Myxococcus xanthus [19].¶

There are several gapfilled reactions for biosynthesis of the coenzyme ubiquinone, involved in respiration of many organisms. No ingredient supplementation of media is required though since myxobacteria generally use other quinones including MK-8 [20].¶

Summary and Future Directions¶

This narrative tutorial covers how to utilize shotgun metagenomic data to predict necessary media ingredients for isolation and growth of microbiome members whose MAGs are high-quality.¶

Once a microbe is isolated, Kbase applications including flux balance analysis can be used in conjunction with growth experiments to refine media formulations for various purposes including optimization of growth and fermentation product yields.¶

References¶

Apps