Introduction

This pipeline processes a single metagenomic dataset (downloaded from NCBI SRA) from the Global Ocean Viromes dataset, identifies the viral sequences using VirSorter, and subsequently classifies them with vConTACT2. Along the way there are several intermediary steps to convert files from one KBase object to another - and though some of these can be skipped for this particular dataset - we're explicitly doing them here as to be more easily generalized to other datasets.

iVirus 2.0 on KBase — KBase Processing pipeline using iVirus 2.0 apps

Tutorial Objectives

Assemble viral metagenome using raw reads
Identify viral sequences from assembly data
Classify predicted viral sequences

Table of contents:

Scientific background... and why viruses are important!
Importing read data from the SRA
Assessing read quality with FastQC
Quality control of reads with Trimmomatic
Post-QC read quality assesssment
Assembling the data with metaSPAdes
Cleaning up contigs
Identifying dsDNA phage using VirSorter
A bit of KBase maintenance
Annotating viral genomes with Prokka
Classifying viral genomes using vConTACT2
Post-vConTACT2 analysis
vConTACT2 Quick Reference Guide
Conclusions
Version History
Feedback and Help

Background - why study viruses?!

Viruses are important for more than a few reasons!

The "most abundant biological entities on the planet"^1,2, with 10³¹ virus-like particles³

3% (0-18%) of any microbial genome is really virus^4,5

On the topic of composition, 8% of the human genome is viral^6,7

Move 10²⁹ genes per day, globally ^8,9

Lyse between 20-40% of ocean microbes daily¹⁰

They steal metabolic genes and can encode key metabolic components (like photosynthesis!)^11,12,13

Can infect other viruses (virophages)¹⁴

Viruses - not microbes - encode many of the toxins we think of as bacterial (bordtella, cholera, shiga, etc) ¹⁵

Fewer than 1% are culturable^16,17

Viruses are often hidden in datasets (both viral and microbial) this guide will help you find them!

Good luck!

Importing read data from the SRA

The reads in this dataset were generated from the Global Ocean Virome. and deposited as ERR594369. This is also known as Tara Oceans Expedition Station 36 surface ("SRF"), a coastal area in the Indian Ocean (more specifically, Northwest Arabian Sea), taken from ~5 m depth.

Image of Tara Oceans Expedition Station 36 — Figure of Station 36 location. Figure heavily modified from Roux et al (2016) Nature

We first need to upload these reads into KBase. Files can be uploaded through a simple drag and drop interface (to the "left" of the Narrative, in the data window, or through Globus. A full guide on data upload and download can be found at http://kbase.us/data-upload-download-guide/. For this dataset, we'll be importing our reads directly from SRA into KBase. Alternatively, you could go to the link (more below), download the file(s) from SRA directly to your computer, and then upload them into KBase.

Let's stay simple and let KBase do all the work: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR594369

Below is a screenshot of the SRA page where the link is located on the page.

Importing read data from the SRA — Figure of link location to copy-and-paste into "Import SRA File as Reads" app in KBase

Assessing read quality

Following import, we always want to check the quality of the data going into an analysis. Unless you have supreme confidence in your viral isolation, extraction, sequencing preparation, and sequencing facility skills, it's always a good idea to know what quality is [eventually] going into an assembly. To quote a populat CS phrase, "Garbage In, Garbage Out." Essentially, this means that if you put poor quality data into your analysis, you're going to get poor quality results.

Quality control of reads

The FastQC of these enriched reads is already pretty good. You could get away with proceeding directly to assembly, but we'll trim reads to remove low-quality regions at the read ends and any residual adapters.

We'll trim using a popular read trimming tool, Trimmomatic . Defaults are okay, unless your read data has unique adapter or sequencing conditions.

Post-QC read assessment

Following read trimming, we'll check to ensure that 1) all adapters are removed (if they even existed in the 1st place) and 2) that the reads are of sufficient quality for assembly.

Assembly with MetaSPAdes

With clean reads, we'll now assemble these reads into contigs using another popular assembler, MetaSPAdes . Of the assemblers currently available in the KBase ecosystem, I'd argue MetaSPAdes performs slightly better than MEGAHIT and better than IDBA-UD. For an excellent review of viral benchmarks regarding assemblers, see this PeerJ article .

Cleaning up the contigs¶

We'll be using VirSorter to identify viral genomes. One caveat is that VirSorter does not work as well for small, linear genomes. As a general rule of thumb, we'll filter out any contigs that are less than 5-kb in length OR any non-circular genomes less than 1.5-kb. Since we don't (yet) have a way to filter on circularity, we'll play it safe and remove any contigs below 5-kb.

Identifying viral genomes using VirSorter

At this point, we're ready to identify what sequences are potentially viral. For this, we'll use VirSorter.

Image of VirSorter categories and representative examples — VirSorter. Figure heavily modified from Roux et al (2015) PeerJ

Default parameters are fine. However, though the default Reference database is RefSeq DB, as long as you trust viral data from virome datasets, you can (should?) use Virome DB. One final note: if the metagenomic database is primarily viral (e.g., purified viral isolates, 0.22 µm filtered bulk samples) then select "Enable virome decontamination."

A bit of KBase maintenance...

Once VirSorter finishes, there will be 1 KBase Assembly object for each VirSorter category (up to 6!), as well as a VirSorter binnedContigs object that includes a bin for each category (just 1 object). For those unfamiliar with VirSorter and its outputs, there are 6 "categories" corresponding to confidence levels and whether or not a prophage was detected. Categories 1-3 are predicted lytic (or rather, >80% of the contig is predictted as viral), and categories 4-6 are prophage. Category 1 is the highest confidence viral category, and category 4 is the highest confidence prophage category. For most automated pipelines, it is recommended to select categories 1-2, and 3-4.

Eventually, we'll need to classify these viral sequences using vConTACT2 . For this, we need to go through a few KBase hoops to get everything in the right format. The first step is to only select the bins we want to continue processing.

In "Modify Bins", under Configure, select the two bins to merge. The "New Bin Name" and "Output BinnedContigs Name" can be the same.

Afterward, we need to extract those Bins as a KBase assembly object because Prokka requires Genome or Assembly objects to work.

Annotate viral genomes with Prokka

vConTACT2 requires one fundamental input - a file that maps predicted genes to their genomes (more below, called the Gene2Genome file). And for that, genes need to be predicted. We go with Prokka here because it does a decent job at predicting genes - and while it's not the best when it comes to viral genomes, it's good enough for our purposes.

Essentially, what is happening is Prokka will call and annotate genes and save that information in the Genome object generated. vConTACT2 will read that information associated with each gene and genome, and create the mapping file.

Prokka's options should be adjusted to viral Kingdom.

Classify annotated viral genomes with vConTACT2

Now that we've handled the minor details, we can take those called genes and use them in vConTACT2 .

vConTACT2 works by using a gene-sharing network to associate viral genomes. The more genes that are shared between two genomes, the higher the probability of those two genomes being phylogenetically related.

Image of vContact2's gene-sharing network — vConTACT2 virus classification. Figure heavily modified from Jang and Bolduc et al (2019) Nat. Biotech

So what's happening in the background? vConTACT2 will extract each viral genome and its associated gene predictions, and build the Gene2Genome table that underpins the whole analysis. Thankfully, vConTACT2+KBase generates this file in the background! For non-KBase users, this could be a challenge unless you let vConTACT2 handle everything.

There are a lot of options for vConTACT2. As a developer, there's a balance between giving enough options to allow for granular control of how the tool operates, and not over-burdening the user with options most are unlikely to change. In KBase, all the default options have been selected. There's no need to change anything - except if you want to use the most recent version of NCBI's Viral RefSeq. Often, users prefer to use the "older" version as that's what was used in the publication, so they're looking for consistency. If you'd like to use the most recent, then there might be very minor differences.

A closer look at the vConTACT2 results

After you've run vConTACT2, you'll get a table with ALL of the genomes in the analysis. The table can be a bit unwieldly as it contains a lot of rows and columns.

The easiest way to manage this (in KBase) is to use the filter function to find YOUR viral genomes. For example, all of our genomes contain NODE - it's a byproduct of the SPAdes assembler. By adding NODE in the appropriate filtering row under the column "Genome" will filter out all the reference genomes.

Image of NODE filtering on column — Filtering vConTACT2 results table using "NODE"

There are 895 viral genomes remaining. This is great, why? It matches the number of genomes that were annotated by Prokka. Now, how many of our viral genomes are clustered? Add Clustered to the "VC Status" column. There are 495 Clustered (or Clustered/Singletons) genomes. Not bad, not great - but this is actual data - not pretty "mock" data.

Image of NODE and VC Status filtering on column — Filtering vConTACT2 results table using "NODE" and "Clustered"

Now let's find out if our viral genomes are associated with any references. This is not fast through the table in KBase, but it is doable, unless you want to download the csv and do some data wrangling. What I do is sort the table by "VC" and scroll through, keeping track of the "Size" and count of the VC. If the "Size" of the VC is greater than the counts of the VC, make note of that VC.

After sorting by VC...

Image of NODE and VC Status filtering on column, then sorted — Sorting vConTACT2 results table after using "NODE" and "Clustered"

And finding some interesting clusters!

One of the first examples we encounter is VC_332_0. It has a VC size of 3, but only 1 member is seen. Remember - we still have the `NODE` filter on! Remove that NODE filter, revealing Cyanophage PSS2 and Synechococcus phage S-CBS2.

Image of VC 332 — Revealing VC 332 without "NODE" filter on, revealing its members

This is excellent, as the Cyanophage and Synechococcus phage are incredibly common in the ocean. A literature search revealed that these sequences are indeed found at the station our SRA reads are derived!

Additional searching reveals at least 7 other VCs that include both reference data and environmental sequences:

VC_286
VC_331
VC_332
VC_333
VC_339
VC_461
VC_497
VC_506

And taking a quick peek into one of the above VCs...

Image of VC 497 — VC 497 and its Pelagibacter phage member

A quick reference guide to vConTACT2

What is a "VC"?

A Viral Cluster is a unit of classification. vConTACT2 uses two terms - with subtle differences - in order to classify sequences.

A VC is a first-pass classification of viral genomes. This frequently represents a group of genomes within the same genus. An example is VC_115.

A VC Subcluster is a second-pass classification. It uses a pre-calculated distance calculation to refine the VCs. These are high-confidence, genus-level groupings. It is common for there to be no change between VC and VC Subcluster. An example is VC_115_0. The final value ("0") represents the subcluster within the original VC. If all members of a VC Subcluster have VC_xx_0, then there was no change. However, if there were further refinements, then the VC Subcluster would be: VC_115_0, VC_115_1, VC_115_2...

VC Status

How vConTACT2 describes how it classified a sequence

Clustered: Genomes "successfully" placed into a genus-level group alongside at least one other genome.

Singleton: Genome was not found to be related to any other genome in the dataset. Most likely reason? Very weak or no overlap with any other genes found on any other genome. How to fix? Add more related genomes.

Clustered/Singleton: Genome was initially clustered, but distance-based optimization identified its placement in the cluster as not genus-level. However, no other genomes were found to be within the same "subcluster" as this genome, and resulted in the genome being stranded, without another member. For these genomes, you can look at its "VC" to see distantly related members. So a viral genome that is Clustered/Singleton (VC_221_1 or VC_221_2) is related to other VC_221 members, but vConTACT2 does not have confidence that these genomes are related at the genus level.

As an aside, VC_221 contains 6 genomes. Four (4) are in VC_221_0, with the other two in VC_221_1 and VC_221_2. Both those genomes are related to the four members of 221_0, but not at the genus level.

Overlap (VC_NN/VC_XX): Genomes identified as sharing significant portions of its gene content with multiple VCs. In other words, vConTACT2 cannot confidently assign it to one OR the other VC. This is incredibly common for viral groups that undergo extensive recombination.

Outlier: Genomes clustered by ClusterONE (a tool used internally by vConTACT2) but were not strongly connected to the other VC members. It is common for these genomes to share a single gene or two to the other members of its closest VC, however the other members in that VC are likely be be sharing 20, 30 or 50+ genes. It is not only unlikely that that particular genome is related at the genus level, but could perhaps be a spurious shared gene and is unlikely to be related at anything lower than family or order.

Conclusions

In summary, we've processed a viral metagenome from public reads available on SRA, identified contigs from the assembled sequence data as putative viruses using VirSorter, and classified them in approximately genus-level clusters with vConTACT2. This analysis revealed 8 VCs where environmental sequence data was found associated with references, and we can have confidence that those sequences at related to those references at the genus level.

Additionally, we've seen a few larger clusters with no associations to reference sequences that could be further investigated using KBase tools. For example - align those genomes in those VCs, identify shared features, extract those features, and make a discovery about certain proteins found throughout your dataset. Or, go one step further and pull in JGI data and then compare against a variety of JGI datasets for global significance - all using existing KBase apps!

Thanks for following along and using these apps for your research!

For further reading:

Roux S, Enault F, Hurwitz BL, Sullivan MB. VirSorter: mining viral signal from microbial genomic data. PeerJ. 2015;3: e985. doi:10.7717/peerj.985
The original VirSorter paper

Bolduc B, Jang HB, Doulcier G, You Z-QZ, Roux S, Sullivan MB. vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect Archaea and Bacteria. PeerJ. 2017;5: e3243. doi:10.7717/peerj.3243
The initial vConTACT paper describing the theory and background

Jang HB, Bolduc B, Zablocki O, Kuhn JH, Roux S, Adriaenssens EM, et al. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat Biotechnol. 2019;37: 632–639. doi:10.1038/s41587-019-0100-8
The significantly improved version of vConTACT that's faster, more accurate, and capable of handling larger datasets

Known Issues & Update History

Known Issues

VirSorter may generate a "Bad KBase object" error. Can be safely ignored
VirSorter DOES NOT LIKE to use Diamond unless adding trusted genomes
vConTACT2 can only work on a single annotated Genome and Assembly object

Update History (MM-DD-YYYY)

11-03-2020

Revised introduction
Added navigation links to TOC and quick home buttons throughout
Adjusted some formatting to adhere better to KBase styling guidelines
Significantly expanded vConTACT2 results section
Added vConTACT2 quick reference guide
Expanded conclusion section

10-26-2020

Initial draft of tutorial built

Feedback & Helpdesk

Was this Narrative helpful? Please provide feedback and let us know:

If you have a question about one of our apps, need to report a bug or have another system-related query, please join our Help Board and post a ticket. Learn about how to do this here: http://kbase.us/help-board.

Created Object Name	Type	Description
Reads.t_paired	PairedEndLibrary	Trimmed Reads
Reads.t_unpaired_fwd	SingleEndLibrary	Trimmed Unpaired Forward Reads
Reads.t_unpaired_rev	SingleEndLibrary	Trimmed Unpaired Reverse Reads

Created Object Name	Type	Description
VirSorter-Category-2	Assembly	KBase Assembly object from VIRSorter
VirSorter-Category-5	Assembly	KBase Assembly object from VIRSorter
VirSorter-Category-6	Assembly	KBase Assembly object from VIRSorter
VirSorter-Category-1	Assembly	KBase Assembly object from VIRSorter
VirSorter-Category-3	Assembly	KBase Assembly object from VIRSorter
VirSorter-Category-4	Assembly	KBase Assembly object from VIRSorter
VirSorter_binnedContigs	BinnedContigs	BinnedContigs from VIRSorter