Genome Extraction from Shotgun Metagenome Sequence Data¶

This Tutorial will guide the user through the process of obtaining high-quality genomes and phylogenetic placement from a metagenome assembly.¶

Genome Extraction from Shotgun Metagenome Sequence Data Thumbnails

Overview¶

KBase has powerful tools for metabolic modeling and comparative phylogenomics of microbial genomes that can be used for developing mechanistic understanding of functional interactions between species in microbial ecosystems. Essential to this process is obtaining high-quality genomes to annotate, either via cultivation or genome extraction from metagenome assembly. KBase has a suite of microbiome analysis Apps meant to be used in concert. After assembly and binning, high-quality bins are annotated and can then be used in Comparative Phylogenomics analyses (see Narrative here) and Metabolic Reconstruction and Community Interaction Modeling (see Narrative here).

Below we present the processing of two related Compost Enrichment Metagenomes (37A & 37B) [Ionic Liquids Impact the Bioenergy Feedstock-Degrading Microbiome and Transcription of Enzymes Relevant to Polysaccharide Hydrolysis] from the Joint BioEnergy Institute (JBEI).

Authors: Dylan Chivian ([email protected]) & Mikayla Clark ([email protected])¶

Main Lessons from this Narrative¶

Learn how to perform Quality Control of read libraries
Learn how to measure taxonomic population structure of environmental shotgun reads
Learn how to assemble metagenomes
Learn how to compare assembly quality
Learn how to bin metagenomic scaffolds into putative lineages (metagenome-assembled genomes, or MAGs)
Learn how to assess MAG quality and extract to individual genome assembly objects
Learn how to annotate genes to obtain KBase genomes which can be used by KBase analysis Apps
Learn how to place MAGs into reference species tree

A Word on Timing¶

While KBase boasts faster processing and run time on many apps over competitors, please keep in mind that large data sets do take time to analyze. Queue times for metagenomic data sets can appear lengthy during periods of high traffic on the servers. Once through the queue, we hope you enjoy our faster run times of hours and days for complex algorithms over the months it can take using other avenues.

As an example, the table below show displays the queue time, run time, and average run time for a selection of apps used in this tutorital.

Note: Queue and run times are from within this Narrative while the average run time is calculated from all jobs run across KBase with that particular app.

App	Queue Time	Run Time	Average Run Time
Kaiju	1s	1h 37m	1h 3m
Assemble with metaSPAdes	11h 8m	7hr 13m	9h
Assemble Reads with MEGAHIT	1s	6h 2m	2h 58m
Assemble with IDBA-UD	3h 19m	7h 41m	4h 41m
MaxBin2 Contig Binning	2s	4h 19m	1h 13m

Description of Apps¶

FastQC allows users to check the quality of raw sequence data generated by high throughput sequencing pipelines. The results legends differentiate among normal (green tick), slightly abnormal (orange exclamation), and very unusual (red cross) reads.
Trimmomatic¹ performs a variety of useful trimming tasks for paired- or single-end Illumina reads improving the overall quality of the data.
Kaiju^3,5 generates fast and sensitive taxonomic classification for metagenomic reads by comparing sequences to databases of known microbial proteins. It also generates an interactive metagenomic visualization chart.
metaSPAdes⁴ assembles metagenomic reads using the SPAdes assembler.
MEGAHIT² assembles metagenomic reads using the MEGAHIT assembler.
IDBA-UD⁷ assembles paired-end reads from single-cell or metagenomic sequencing technologies using the IDBA-UD assembler.
Compare Assembled Contig Distributions allows the user to view distributions of contig characteristics for different assembly runs.
MaxBin2 Contig Binning^9,10 uses nucleotide composition information, source strain abundance, and phylogentic marker genes to perform binning through an Expectation-Maximization algorithm.
Assess Genome Quality with CheckM⁶ provides a set of tools for assessing the quality of genomes or metagenomes. It also generates robust estimates of genome completeness and contamination.
Extract Bins as Assemblies from BinnedContigs extracts bins from a BinnedContig dataset as Assembly objects.
Annotate Microbial Assembly annotates a bacterial or archaeal assembly using the RAST (Rapid Annotations using Subsystems Technology) pipeline.
Build GenomeSet allows the user to group Genomes into a GenomeSet.
Insert Set of Genomes into Species Tree⁸ constructs a phylogenetic tree combining the GenomeSet provided by the user with a set of closely related genomes from the KBase list of species.
View Tree displays a SpeciesTree or GenetTree as an image and allows users to download images and NEWICK representations. (Note: This app is in Beta, and therefore, the apps panel must be put into Beta rather than Released to add it to the narrative.)

1. Read Hygiene¶

We begin by importing the sets of paired-end reads in FASTQ format for metagenomes 37A and 37B. The import App creates a PairedEndLibrary object that we can then run through FastQC and Trimmomatic to determine and improve the quality of the reads, respectively.

Note: Running FastQC a second time after the reads had been run through Trimmomatic showed marked improvement in their quality. For this reason, the trimmed reads will be used for further analysis.

2. Classify Taxonomy¶

Our downstream analysis will generate Species Trees of the organisms present in the compost samples. Classifying the taxonomy with Kaiju is important because it predicts the microbial composition based on protein similarities rather than genome assembly and annotation. This prediction can be used to compare the Species Trees against.

3. Assemble¶

Now that we have cleaned the reads, we can move on to the next step: assembling the reads into contiguous fragments (contigs) thus creating the scaffolding of the whole genome. KBase offers several metagenomic assemblers and a tool for comparing their output (similar to QUAST). We will run three assembly Apps below.

Note: metaSPAdes only accepts a single library as input, so the App Merge Multiple ReadsLib to One ReadsLib was used to combine the reads from 37A and 37B. Because the combined 37AB reads produced the largest contig and the most contigs over 100,000 bp, it will be used as input for the MEGAHIT and IDBA-UD assemblers.

Note: We will run the MEGAHIT assembler twice using different parameters but the same input data. The first run will use "meta-large" as its preset. This is a setting catered towards large and complex assemblies. The second run will use the "meta-sensitive" preset. This parameter generates a more sensitive assembly but runs slower.

4. Compare Contigs¶

Now that we have six sets of contigs generated from our reads using various assemblers, we can examine them side-by-side to determine the one of highest quality. Using the best assembly will produce more accurate results in downstream analyses.

The table below summarizes the most important values (high N50, low L50, fewest contigs) generated by running Compare Assembled Contig Distribution. The best values in each category have been underlined and italicized.

N50: the shortest sequence length containing 50% of the entire assembly.
L50: the least number of contigs whose sum lenth is equal to N50.

Assembly	Number of Contigs	Longest Contig (bp)	N50	L50	Contigs > 10⁶	Sum Length (bp) Contigs
37AB_metaSPAdes	27508	1395711	23793	1871	2	2791420
37AB_MEGAHIT_metalarge	32090	1501073	14663	3350	1	1501073
37AB_MEGAHIT_metasensitive	31055	1501101	16426	3008	1	1501101
37AB_IDBA-UD	23276	421547	15901	2717	0	0

Note: The Assembly from running metaSPAdes on the combined library generated the highest quality contigs (37AB_metaSPAdes.contigs). Therefore, it will be used for further analysis.

5. Bin Contigs¶

Having assembled the contigs, the next step is to cluster them into bins, each of which corresponds to a putative population genome. To accomplish this, we will use MaxBin2 Contig Binning.

6. Bin Quality Assessment¶

Quality control is a necessary step at every level of analysis to ensure the highest quality outcome and to avoid error propagation.

Note: From the graphic output below, we see that of the 65 total bins, 28 are both ≥90% complete and ≤2.5% contaminated. These bins (1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 22, 23, 26, 30, 32, 33, 35, 56, 59, 62, and 64) will be used for further analysis.

7. Extract Individual Assemblies¶

To use the desired 28 high quality bins in downstream Apps, we need them to be in the form of Assembly objects. This is achieved by running Extract Bins as Assemblies from BinnedContigs.

Note: The default is to extract ALL bins. We must specify the ones that we want.

8. Annotate Genomes¶

Since we now have the high quality bins in Assembly object form (and collected into an Assembly Set object), we will use Annotate Multiple Microbial Assemblies to turn them into annotated Genomes using the Rapid Annotation Subsystem Technology (RAST) pipeline.

Note: If you wish to just do a limited number of annotations, you can run them separately with the Annotate Microbial Assembly App.

Once the high qualitiy bins have all been annotated, we can combine them into a single GenomeSet. The resulting GenomeSet object will be used as input for the next step.

Note: Even though the GenomeSet is labelled Bins001-065, it only consists of the 28 high quality bins.

9. Find Relatives¶

Our new GenomeSet can be used as input for Insert Set of Genomes Into Species Tree, which will give us an initial phylogenetic placement of the bins.

Note: Uncheck 'Copy public genomes to your workspace' because we are not ready to determine which genomes from RefSeq we want to include in downstream comparisons yet.

The current implementation of Insert Genomes into Species Tree has a tendency to overemphasize proximal genomes at the expense of phylogenetic diversity. Future versions will remedy this shortcoming. In the meantime, we have to manually implement this approach to remove excessive genome attractors. We will split the bins into 5 additional clades based on the initial tree, which we will call A, B, C, D, and E. We will use Build GenomeSet to group the bins into clades.

Clade A: bins 26, 56, 5, and 35 Clade B: bins 22, 30, 8, 9, 15, and 64 Clade C: bins 7, 14, 59, 19, 10, and 13 Clade D: bins 16, 33, 17, 62, 23, 4, 12, 2, 11, and 32 Clade E: bins 1 and 3

To get a more accurate phylogentic trees, we will rerun Insert Set of Genomes Into Species Tree for each of the five clades.

Note: Again it will be necessary to uncheck 'Copy public genomes into workspace'. ViewTree (beta) was run to enable users to download the image of the SpeciesTree and the NEWICK representations.

10. Place Genomes into Phylogenetic Context with Phylum Exemplars¶

Run the Build Microbial SpeciesTree App to include Phylum Exemplars in the Species Tree.

Note: Build Microbial SpeciesTree is currently a beta App. To acccess beta Apps, click the "R" in the upper right corner of the App pane to switch it to "B".

Summary and Future Directions¶

This Narrative Tutorial covers how to generate annotated genomes and species predictions from raw metagenomic reads. Taxonomic abundance can be generated based on protein similarity from the raw reads using Kaiju or from annotated genome similarity to reference genomes through the creation of species trees.

Genome extraction and species prediction are just the beginning of how metagenomic samples can be analyzed within KBase. Annotated genomes can be used for metabolic modeling, comparative phylogenomics, functional profiling, and more.

Reference Literature¶

Wu YW, Higgins B, Yu C, Reddy AP, Ceballos S, Joh LD, Simmons BA, Singer SW, VanderGheynst JS. Ionic Liquids Impact the Bioenergy Feedstock-Degrading Microbiome and Transcription of Enzymes Relevant to Polysaccharide Hydrolysis. mSystems. 2016 Dec 13;1(6). pii: e00120-16. eCollection 2016 Nov-Dec. doi:10.1128/mSystems.00120-16 https://www.ncbi.nlm.nih.gov/pubmed/27981239
Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30: 2114–2120. doi:10.1093/bioinformatics/btu170 http://www.ncbi.nlm.nih.gov/pubmed/24695404
Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications. 2016;7: 11257. doi:10.1038/ncomms11257 http://www.ncbi.nlm.nih.gov/pubmed/27071849
Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics. 2011;12: 385. doi:10.1186/1471-2105-12-385http://www.ncbi.nlm.nih.gov/pubmed/21961884
Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017 May;27(5):824-834. doi: 10.1101/gr.213959.116. https://www.ncbi.nlm.nih.gov/pubmed/28298430
Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015;31: 1674–1676. doi:10.1093/bioinformatics/btv033 http://www.ncbi.nlm.nih.gov/pubmed/25609793
Peng Y, Leung HCM, Yiu SM, Chin FYL. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28: 1420–1428. doi:10.1093/bioinformatics/bts174 https://www.ncbi.nlm.nih.gov/pubmed/22495754
Wu Y-W, Simmons BA, Singer SW. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics. 2016;32: 605–607. doi:10.1093/bioinformatics/btv638 https://www.ncbi.nlm.nih.gov/pubmed/26515820
Wu Y-W, Tang Y-H, Tringe SG, Simmons BA, Singer SW. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome. 2014;2: 26. doi:10.1186/2049-2618-2-26 https://microbiomejournal.biomedcentral.com/articles/10.1186/2049-2618-2-26
Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research. 2015;25: 1043–1055. doi:10.1101/gr.186072.114 http://genome.cshlp.org/content/25/7/1043.long
Price MN, Dehal PS, Arkin AP. FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. Poon AFY, editor. PLoS ONE. 2010;5: e9490. doi:10.1371/journal.pone.0009490 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2835736/

Created Object Name	Type	Description
37A_Trimm_headcrop5_crop140.PElib_paired	PairedEndLibrary	Trimmed Reads
37A_Trimm_headcrop5_crop140.PElib_unpaired_fwd	SingleEndLibrary	Trimmed Unpaired Forward Reads
37A_Trimm_headcrop5_crop140.PElib_unpaired_rev	SingleEndLibrary	Trimmed Unpaired Reverse Reads

Created Object Name	Type	Description
37B_Trimm_headcrop5_crop140.PELib_paired	PairedEndLibrary	Trimmed Reads
37B_Trimm_headcrop5_crop140.PELib_unpaired_fwd	SingleEndLibrary	Trimmed Unpaired Forward Reads
37B_Trimm_headcrop5_crop140.PELib_unpaired_rev	SingleEndLibrary	Trimmed Unpaired Reverse Reads

Created Object Name	Type	Description
extracted_bins.AssemblySet	AssemblySet	Assembly set of extracted assemblies
Bin.001.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.002.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.003.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.004.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.005.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.007.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.008.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.009.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.010.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.011.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.012.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.013.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.014.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.015.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.016.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.017.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.019.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.022.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.023.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.026.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.030.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.032.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.033.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.035.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.056.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.059.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.062.fasta_assembly	Assembly	Assembly object of extracted contigs
Bin.064.fasta_assembly	Assembly	Assembly object of extracted contigs