Generated June 25, 2020

Welcome to the Viral Annotation Pipeline

This pipeline processes a tiny metagenome, identifies viruses using VirSorter, and classifies them with vConTACT2. Along the way there are several intermediary steps to convert files from one format to another, and though some of these can be skipped for this particular dataset, we're explicitly doing them here as to be more easily generalized to other datasets.

Table of Contents

  1. Importing Data
  2. Identifying viruses using VirSorter
  3. A bit of KBase maintenance
  4. Converting to assemblies
  5. Using vConTACT2
  6. Conclusion

Importing data

In the next two steps, we'll be importing 2 files: a small sample of an already assembled metagenome and a Cyanophage genome. KBase needs to generate an object to work on based on the data in files, rather than the file itself. This increases speed and interoperability between Apps at the cost of an additional step when bringing the data onto the system. Files can be uploaded through a simple drag and drop interface, or through Globus. A full data upload and download can be found at http://kbase.us/data-upload-download-guide/.

Import a FASTA file from your staging area into your Narrative as an Assembly data object
This app completed without errors in 1m 2s.
Objects
Created Object Name Type Description
MetagenomeAssembly Assembly Imported Assembly
Links
Import a FASTA file from your staging area into your Narrative as an Assembly data object
This app completed without errors in 59s.
Objects
Created Object Name Type Description
Cyanophage_genome Assembly Imported Assembly
Links

Identify viruses using VirSorter

After importing our metagenomes, we'll need to identify what sequences are potentially viral. For this, we'll use VirSorter. Default parameters are fine. However, though the default Reference database is RefSeq DB, as long as you trust viral data from virome datasets, you can (should?) use Virome DB. One final note: if the metagenomic database is primarily viral (e.g., purified viral isolates, 0.22 µm filtered bulk samples) then select "Enable virome decontamination."

Identifies viral sequences from viral and microbial metagenomes
This app completed without errors in 36m 13s.
Objects
Created Object Name Type Description
VirSorter-Category-1 Assembly KBase Assembly object from VIRSorter
VirSorter-Category-2 Assembly KBase Assembly object from VIRSorter
VirSorter-Category-3 Assembly KBase Assembly object from VIRSorter
VirSorter_binnedContigs BinnedContigs BinnedContigs from VIRSorter
Summary
Here are the results from your VIRSorter run. Above, you'll find a report with all the identified (putative) viral genomes, and below, links to the report as well as files generated.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/59912
  • VIRSorter_predicted_viral_fna.tar.gz - FASTA-formatted nucleotide sequences of VIRSorter predicted viruses
  • VIRSorter_predicted_viral_gb.tar.gz - Genbank-formatted sequences of VIRSorter predicted viruses

A bit of KBase maintenance...

Once VirSorter finishes, there will be 1 Kbase Assembly object for each VirSorter category (up to 6), as well as a VirSorter binnedContigs object that includes a bin for each category (just 1 object). For those unfamiliar with VirSorter and its outputs, there are 6 "categories" corresponding to confidence levels and whether or not a prophage was detected. Categories 1-3 are predicted lytic (or rather, >80% of the contig is viral), and categories 4-6 are prophage. Category 1 is the highest confidence viral category, and category 4 is the highest confidence prophage category. For most automated pipelines, it is recommended to select categories 1-2, and 3-4.

The next step is to classify these viral sequences using vConTACT2. For this, we'll need to go through a few KBase hoops to get everything in the right format. The first step is to only select the bins we want to continue processing. Notice we have a VirSorter category 3 bin. We don't want that, because those are low-confidence predictions. Below, we'll remove category 3 and create a new binnedContig object with category 1 and 2 combined.

Add or remove specific bins by name in BinnedContigs data
This app completed without errors in 31s.
Summary
Job Finished Generated BinnedContigs: VirSorter_bins12 [59912/18/1] -------------------------- Summary: Binned contigs: 10 Total size of bins: 2 Bin IDs: VirSorter.003.fasta VirSorter_binnedContigs12

Continuing the conversion...

After the bins are merged, we need to extract them as Assemblies. The next step is predicting genes with Prodigal, and that requires either 1) Genome objects or 2) Assembly objects. The "Extract Bins as Assembies from BinnedContigs" App, as the name implies, separates the bins and converts them to Assembly objects that can be operated on individually.

We will then annotate the extracted bins using Prokka.

Extract a bin as an Assembly from a BinnedContig dataset
This app completed without errors in 1m 8s.
Objects
Created Object Name Type Description
VirSorter_binnedContigs12_assembly Assembly Assembly object of extracted contigs
Summary
Job Finished Generated Assembly Reference: 59912/20/1
Annotate Assembly and Re-annotate Genomes with Prokka annotation pipeline.
This app completed without errors in 1m 14s.
Objects
Created Object Name Type Description
VirSorter_cat12 Genome Annotated Genome
Summary
Annotated Genome saved to: bbolduc:narrative_1586908887820/VirSorter_cat12 Number of genes predicted: 430 Number of protein coding genes: 428 Number of genes with non-hypothetical function: 29 Number of genes with EC-number: 20 Number of genes with Seed Subsystem Ontology: 16 Average protein length: 203 aa.
Output from Annotate Assembly and Re-annotate Genomes with Prokka(v1.12)
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/59912

With genes called, can proceed to vConTACT2

Now that we've handled the minor details, we can take those called genes and use them in vConTACT2.

So what's happening in the background? The assembly objects (i.e. predicted viral genome) have had their genes predicted. This "annotation/prediction" information is stored for each of the genomes in the "Genome" object. (Yes, it's a bit of a misnomer, the "Genome" object contains genomeS). vConTACT2 will extract each viral genome and its associated gene predictions, and build the Gene2Genome table that underpins the whole analysis. For non-KBase users, this could be a challenge unless you let vConTACT2 handle everything.

There are a lot of options for vConTACT2. As a developer, there's a balance between giving enough options to allow for granular control of how the tool operates, and not over-burdening the user with options most are unlikely to change. In KBase, all the default options have been selected. There's no need to change anything - except if you want to use the most recent version of NCBI's Viral RefSeq. Often, users prefer to use the "older" version as that's what was used in the publication, so they're looking for consistency. If you'd like to use the most recent, then there might be very minor differences.

Viral cluster automatic cluster taxonomy
This app completed without errors in 2h 46m 3s.
Summary
Basic message to show in the report
Links
Output from vConTACT2 0.9.15
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/59912

You're done! Congratulations! Onward!

After you've run vConTACT2, you'll get a table with ALL of the genomes in the analysis. The table can be a bit unwieldly as it contains a lot of rows and columns. The easiest way to manage this (in KBase) is to use the filter function to find YOUR viral genomes. In our example, all of our genomes contain VIRSorter. Filtering on that word gives us 8 rows. That's great. We put in 8 genomes, and vConTACT2 gave us 8 rows. Now check the "VC" column. Four (4) of our genomes were clustered - you can double-check "VC Status" to see if they were Clustered. Once you've identified the cluster, check to see if they co-cluster with any known RefSeq genomes. VC_227_0 clusters with Synechococcus Phage S CBS1 and 3. That's also good, as the dataset used included that genome. VC_228_0 doesn't cluster with anything else, but knowing those 2 genomes cluster together gives you confidence that they're at least related at the genus level. VC_200_0 clusters with Cyanophage and Prochlorococcus viruses. Also good news, as there's also a Cyanophage in the sample metagenome! The other 4 genomes don't cluster with anything, but one was an Outlier. Outliers are "connected" to clusters (in the network) but don't have sufficient confidence to place them within that cluster.

More to follow!

Apps

  1. Annotate Assembly and Re-annotate Genomes with Prokka(v1.12)
    • Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014;30: 2068 2069. doi:10.1093/bioinformatics/btu153
  2. Extract Bins as Assemblies from BinnedContigs - v1.0.2
    • Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163
  3. Import FASTA File as Assembly from Staging Area
    no citations
  4. Modify Bins in BinnedContigs - v1.0.2
    • Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163
  5. vConTACT2 0.9.15
    • Bin Jang, H., Bolduc, B., Zablocki, O., Kuhn, J. H., Roux, S., Adriaenssens, E. M., Sullivan, M. B. (2019). Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nature Biotechnology, 37(6), 632 639. https://doi.org/10.1038/s41587-019-0100-8
  6. VirSorter 1.0.5
    • Roux S, Enault F, Hurwitz BL, Sullivan MB. (2015). VirSorter: mining viral signal from microbial genomic data. PeerJ 3:e985.