This pipeline processes a single metagenomic dataset (downloaded from NCBI SRA) from the Global Ocean Viromes dataset, identifies the viral sequences using VirSorter, and subsequently classifies them with vConTACT2. Along the way there are several intermediary steps to convert files from one KBase object to another - and though some of these can be skipped for this particular dataset - we're explicitly doing them here as to be more easily generalized to other datasets.
Tutorial Objectives
Table of contents:
The "most abundant biological entities on the planet"1,2, with 1031 virus-like particles3
3% (0-18%) of any microbial genome is really virus4,5
On the topic of composition, 8% of the human genome is viral6,7
Move 1029 genes per day, globally 8,9
Lyse between 20-40% of ocean microbes daily10
They steal metabolic genes and can encode key metabolic components (like photosynthesis!)11,12,13
Can infect other viruses (virophages)14
Viruses - not microbes - encode many of the toxins we think of as bacterial (bordtella, cholera, shiga, etc) 15
Fewer than 1% are culturable16,17
Viruses are often hidden in datasets (both viral and microbial) this guide will help you find them!
Good luck!
The reads in this dataset were generated from the Global Ocean Virome. and deposited as ERR594369. This is also known as Tara Oceans Expedition Station 36 surface ("SRF"), a coastal area in the Indian Ocean (more specifically, Northwest Arabian Sea), taken from ~5 m depth.
We first need to upload these reads into KBase. Files can be uploaded through a simple drag and drop interface (to the "left" of the Narrative, in the data window, or through Globus. A full guide on data upload and download can be found at http://kbase.us/data-upload-download-guide/. For this dataset, we'll be importing our reads directly from SRA into KBase. Alternatively, you could go to the link (more below), download the file(s) from SRA directly to your computer, and then upload them into KBase.
Let's stay simple and let KBase do all the work: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR594369
Below is a screenshot of the SRA page where the link is located on the page.
Following import, we always want to check the quality of the data going into an analysis. Unless you have supreme confidence in your viral isolation, extraction, sequencing preparation, and sequencing facility skills, it's always a good idea to know what quality is [eventually] going into an assembly. To quote a populat CS phrase, "Garbage In, Garbage Out." Essentially, this means that if you put poor quality data into your analysis, you're going to get poor quality results.
The FastQC of these enriched reads is already pretty good. You could get away with proceeding directly to assembly, but we'll trim reads to remove low-quality regions at the read ends and any residual adapters.
We'll trim using a popular read trimming tool, Trimmomatic . Defaults are okay, unless your read data has unique adapter or sequencing conditions.
With clean reads, we'll now assemble these reads into contigs using another popular assembler, MetaSPAdes . Of the assemblers currently available in the KBase ecosystem, I'd argue MetaSPAdes performs slightly better than MEGAHIT and better than IDBA-UD. For an excellent review of viral benchmarks regarding assemblers, see this PeerJ article .
We'll be using VirSorter to identify viral genomes. One caveat is that VirSorter does not work as well for small, linear genomes. As a general rule of thumb, we'll filter out any contigs that are less than 5-kb in length OR any non-circular genomes less than 1.5-kb. Since we don't (yet) have a way to filter on circularity, we'll play it safe and remove any contigs below 5-kb.
At this point, we're ready to identify what sequences are potentially viral. For this, we'll use VirSorter.
Default parameters are fine. However, though the default Reference database is RefSeq DB, as long as you trust viral data from virome datasets, you can (should?) use Virome DB. One final note: if the metagenomic database is primarily viral (e.g., purified viral isolates, 0.22 µm filtered bulk samples) then select "Enable virome decontamination."
Once VirSorter finishes, there will be 1 KBase Assembly object for each VirSorter category (up to 6!), as well as a VirSorter binnedContigs object that includes a bin for each category (just 1 object). For those unfamiliar with VirSorter and its outputs, there are 6 "categories" corresponding to confidence levels and whether or not a prophage was detected. Categories 1-3 are predicted lytic (or rather, >80% of the contig is predictted as viral), and categories 4-6 are prophage. Category 1 is the highest confidence viral category, and category 4 is the highest confidence prophage category. For most automated pipelines, it is recommended to select categories 1-2, and 3-4.
Eventually, we'll need to classify these viral sequences using vConTACT2 . For this, we need to go through a few KBase hoops to get everything in the right format. The first step is to only select the bins we want to continue processing.
In "Modify Bins", under Configure, select the two bins to merge. The "New Bin Name" and "Output BinnedContigs Name" can be the same.
Afterward, we need to extract those Bins as a KBase assembly object because Prokka requires Genome or Assembly objects to work.
vConTACT2 requires one fundamental input - a file that maps predicted genes to their genomes (more below, called the Gene2Genome file). And for that, genes need to be predicted. We go with Prokka here because it does a decent job at predicting genes - and while it's not the best when it comes to viral genomes, it's good enough for our purposes.
Essentially, what is happening is Prokka will call and annotate genes and save that information in the Genome object generated. vConTACT2 will read that information associated with each gene and genome, and create the mapping file.
Prokka's options should be adjusted to viral Kingdom.
Now that we've handled the minor details, we can take those called genes and use them in vConTACT2 .
vConTACT2 works by using a gene-sharing network to associate viral genomes. The more genes that are shared between two genomes, the higher the probability of those two genomes being phylogenetically related.
So what's happening in the background? vConTACT2 will extract each viral genome and its associated gene predictions, and build the Gene2Genome table that underpins the whole analysis. Thankfully, vConTACT2+KBase generates this file in the background! For non-KBase users, this could be a challenge unless you let vConTACT2 handle everything.
There are a lot of options for vConTACT2. As a developer, there's a balance between giving enough options to allow for granular control of how the tool operates, and not over-burdening the user with options most are unlikely to change. In KBase, all the default options have been selected. There's no need to change anything - except if you want to use the most recent version of NCBI's Viral RefSeq. Often, users prefer to use the "older" version as that's what was used in the publication, so they're looking for consistency. If you'd like to use the most recent, then there might be very minor differences.
After you've run vConTACT2, you'll get a table with ALL of the genomes in the analysis. The table can be a bit unwieldly as it contains a lot of rows and columns.
The easiest way to manage this (in KBase) is to use the filter function to find YOUR viral genomes. For example, all of our genomes contain NODE
- it's a byproduct of the SPAdes assembler. By adding NODE
in the appropriate filtering row under the column "Genome" will filter out all the reference genomes.
There are 895 viral genomes remaining. This is great, why? It matches the number of genomes that were annotated by Prokka. Now, how many of our viral genomes are clustered? Add Clustered
to the "VC Status" column. There are 495 Clustered (or Clustered/Singletons) genomes. Not bad, not great - but this is actual data - not pretty "mock" data.
Now let's find out if our viral genomes are associated with any references. This is not fast through the table in KBase, but it is doable, unless you want to download the csv and do some data wrangling. What I do is sort the table by "VC" and scroll through, keeping track of the "Size" and count of the VC. If the "Size" of the VC is greater than the counts of the VC, make note of that VC.
After sorting by VC...
And finding some interesting clusters!
One of the first examples we encounter is VC_332_0
. It has a VC size of 3, but only 1 member is seen. Remember - we still have the `NODE` filter on! Remove that NODE filter, revealing Cyanophage PSS2 and Synechococcus phage S-CBS2.
This is excellent, as the Cyanophage and Synechococcus phage are incredibly common in the ocean. A literature search revealed that these sequences are indeed found at the station our SRA reads are derived!
Additional searching reveals at least 7 other VCs that include both reference data and environmental sequences:
And taking a quick peek into one of the above VCs...
A VC is a first-pass classification of viral genomes. This frequently represents a group of genomes within the same genus. An example is VC_115
.
A VC Subcluster is a second-pass classification. It uses a pre-calculated distance calculation to refine the VCs. These are high-confidence, genus-level groupings. It is common for there to be no change between VC and VC Subcluster. An example is VC_115_0
. The final value ("0") represents the subcluster within the original VC. If all members of a VC Subcluster have VC_xx_0
, then there was no change. However, if there were further refinements, then the VC Subcluster would be: VC_115_0
, VC_115_1
, VC_115_2
...
Clustered: Genomes "successfully" placed into a genus-level group alongside at least one other genome.
Singleton: Genome was not found to be related to any other genome in the dataset. Most likely reason? Very weak or no overlap with any other genes found on any other genome. How to fix? Add more related genomes.
Clustered/Singleton: Genome was initially clustered, but distance-based optimization identified its placement in the cluster as not genus-level. However, no other genomes were found to be within the same "subcluster" as this genome, and resulted in the genome being stranded, without another member. For these genomes, you can look at its "VC" to see distantly related members. So a viral genome that is Clustered/Singleton (VC_221_1 or VC_221_2) is related to other VC_221 members, but vConTACT2 does not have confidence that these genomes are related at the genus level.
As an aside, VC_221 contains 6 genomes. Four (4) are in VC_221_0, with the other two in VC_221_1 and VC_221_2. Both those genomes are related to the four members of 221_0, but not at the genus level.
Overlap (VC_NN/VC_XX): Genomes identified as sharing significant portions of its gene content with multiple VCs. In other words, vConTACT2 cannot confidently assign it to one OR the other VC. This is incredibly common for viral groups that undergo extensive recombination.
Outlier: Genomes clustered by ClusterONE (a tool used internally by vConTACT2) but were not strongly connected to the other VC members. It is common for these genomes to share a single gene or two to the other members of its closest VC, however the other members in that VC are likely be be sharing 20, 30 or 50+ genes. It is not only unlikely that that particular genome is related at the genus level, but could perhaps be a spurious shared gene and is unlikely to be related at anything lower than family or order.
In summary, we've processed a viral metagenome from public reads available on SRA, identified contigs from the assembled sequence data as putative viruses using VirSorter, and classified them in approximately genus-level clusters with vConTACT2. This analysis revealed 8 VCs where environmental sequence data was found associated with references, and we can have confidence that those sequences at related to those references at the genus level.
Additionally, we've seen a few larger clusters with no associations to reference sequences that could be further investigated using KBase tools. For example - align those genomes in those VCs, identify shared features, extract those features, and make a discovery about certain proteins found throughout your dataset. Or, go one step further and pull in JGI data and then compare against a variety of JGI datasets for global significance - all using existing KBase apps!
For further reading:
Known Issues
Update History (MM-DD-YYYY)
11-03-2020
10-26-2020
Was this Narrative helpful? Please provide feedback and let us know:
If you have a question about one of our apps, need to report a bug or have another system-related query, please join our Help Board and post a ticket. Learn about how to do this here: http://kbase.us/help-board.