This pipeline processes a tiny metagenome, identifies viruses using VirSorter, and classifies them with vConTACT2. Along the way there are several intermediary steps to convert files from one format to another, and though some of these can be skipped for this particular dataset, we're explicitly doing them here as to be more easily generalized to other datasets.
Table of Contents
In the next two steps, we'll be importing 2 files: a small sample of an already assembled metagenome and a Cyanophage genome. KBase needs to generate an object to work on based on the data in files, rather than the file itself. This increases speed and interoperability between Apps at the cost of an additional step when bringing the data onto the system. Files can be uploaded through a simple drag and drop interface, or through Globus. A full data upload and download can be found at http://kbase.us/data-upload-download-guide/.
After importing our metagenomes, we'll need to identify what sequences are potentially viral. For this, we'll use VirSorter. Default parameters are fine. However, though the default Reference database is RefSeq DB, as long as you trust viral data from virome datasets, you can (should?) use Virome DB. One final note: if the metagenomic database is primarily viral (e.g., purified viral isolates, 0.22 µm filtered bulk samples) then select "Enable virome decontamination."
Once VirSorter finishes, there will be 1 Kbase Assembly object for each VirSorter category (up to 6), as well as a VirSorter binnedContigs object that includes a bin for each category (just 1 object). For those unfamiliar with VirSorter and its outputs, there are 6 "categories" corresponding to confidence levels and whether or not a prophage was detected. Categories 1-3 are predicted lytic (or rather, >80% of the contig is viral), and categories 4-6 are prophage. Category 1 is the highest confidence viral category, and category 4 is the highest confidence prophage category. For most automated pipelines, it is recommended to select categories 1-2, and 3-4.
The next step is to classify these viral sequences using vConTACT2. For this, we'll need to go through a few KBase hoops to get everything in the right format. The first step is to only select the bins we want to continue processing. Notice we have a VirSorter category 3 bin. We don't want that, because those are low-confidence predictions. Below, we'll remove category 3 and create a new binnedContig object with category 1 and 2 combined.
After the bins are merged, we need to extract them as Assemblies. The next step is predicting genes with Prodigal, and that requires either 1) Genome objects or 2) Assembly objects. The "Extract Bins as Assembies from BinnedContigs" App, as the name implies, separates the bins and converts them to Assembly objects that can be operated on individually.
We will then annotate the extracted bins using Prokka.
Now that we've handled the minor details, we can take those called genes and use them in vConTACT2.
So what's happening in the background? The assembly objects (i.e. predicted viral genome) have had their genes predicted. This "annotation/prediction" information is stored for each of the genomes in the "Genome" object. (Yes, it's a bit of a misnomer, the "Genome" object contains genomeS). vConTACT2 will extract each viral genome and its associated gene predictions, and build the Gene2Genome table that underpins the whole analysis. For non-KBase users, this could be a challenge unless you let vConTACT2 handle everything.
There are a lot of options for vConTACT2. As a developer, there's a balance between giving enough options to allow for granular control of how the tool operates, and not over-burdening the user with options most are unlikely to change. In KBase, all the default options have been selected. There's no need to change anything - except if you want to use the most recent version of NCBI's Viral RefSeq. Often, users prefer to use the "older" version as that's what was used in the publication, so they're looking for consistency. If you'd like to use the most recent, then there might be very minor differences.
After you've run vConTACT2, you'll get a table with ALL of the genomes in the analysis. The table can be a bit unwieldly as it contains a lot of rows and columns. The easiest way to manage this (in KBase) is to use the filter function to find YOUR viral genomes. In our example, all of our genomes contain VIRSorter
. Filtering on that word gives us 8 rows. That's great. We put in 8 genomes, and vConTACT2 gave us 8 rows. Now check the "VC" column. Four (4) of our genomes were clustered - you can double-check "VC Status" to see if they were Clustered. Once you've identified the cluster, check to see if they co-cluster with any known RefSeq genomes. VC_227_0 clusters with Synechococcus Phage S CBS1 and 3. That's also good, as the dataset used included that genome. VC_228_0 doesn't cluster with anything else, but knowing those 2 genomes cluster together gives you confidence that they're at least related at the genus level. VC_200_0 clusters with Cyanophage and Prochlorococcus viruses. Also good news, as there's also a Cyanophage in the sample metagenome! The other 4 genomes don't cluster with anything, but one was an Outlier. Outliers are "connected" to clusters (in the network) but don't have sufficient confidence to place them within that cluster.
More to follow!