Generated November 3, 2020

# Introduction

This pipeline processes a single metagenomic dataset (downloaded from NCBI SRA) from the Global Ocean Viromes dataset, identifies the viral sequences using VirSorter, and subsequently classifies them with vConTACT2. Along the way there are several intermediary steps to convert files from one KBase object to another - and though some of these can be skipped for this particular dataset - we're explicitly doing them here as to be more easily generalized to other datasets.

Tutorial Objectives

• Assemble viral metagenome using raw reads
• Identify viral sequences from assembly data
• Classify predicted viral sequences

## Background - why study viruses?!

### Viruses are important for more than a few reasons!

The "most abundant biological entities on the planet"1,2, with 1031 virus-like particles3

3% (0-18%) of any microbial genome is really virus4,5

On the topic of composition, 8% of the human genome is viral6,7

Move 1029 genes per day, globally 8,9

Lyse between 20-40% of ocean microbes daily10

They steal metabolic genes and can encode key metabolic components (like photosynthesis!)11,12,13

Can infect other viruses (virophages)14

Viruses - not microbes - encode many of the toxins we think of as bacterial (bordtella, cholera, shiga, etc) 15

Fewer than 1% are culturable16,17

Viruses are often hidden in datasets (both viral and microbial) this guide will help you find them!

Good luck!

## Importing read data from the SRA

The reads in this dataset were generated from the Global Ocean Virome. and deposited as ERR594369. This is also known as Tara Oceans Expedition Station 36 surface ("SRF"), a coastal area in the Indian Ocean (more specifically, Northwest Arabian Sea), taken from ~5 m depth.

Let's stay simple and let KBase do all the work: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR594369

Below is a screenshot of the SRA page where the link is located on the page.

Import an SRA file from a web URL into your Narrative as a Reads data object.
This app completed without errors in 2h 15m 4s.
Objects
Created Object Name Type Description
Report

Following import, we always want to check the quality of the data going into an analysis. Unless you have supreme confidence in your viral isolation, extraction, sequencing preparation, and sequencing facility skills, it's always a good idea to know what quality is [eventually] going into an assembly. To quote a populat CS phrase, "Garbage In, Garbage Out." Essentially, this means that if you put poor quality data into your analysis, you're going to get poor quality results.

A quality control application for high throughput sequence data.
This app completed without errors in 31m 2s.
Report
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/75811
• ERR594369_75811_2_1.rev_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
• ERR594369_75811_2_1.fwd_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report

The FastQC of these enriched reads is already pretty good. You could get away with proceeding directly to assembly, but we'll trim reads to remove low-quality regions at the read ends and any residual adapters.

We'll trim using a popular read trimming tool, Trimmomatic . Defaults are okay, unless your read data has unique adapter or sequencing conditions.

Trim paired- or single-end Illumina reads with Trimmomatic.
This app completed without errors in 2h 8m 60s.
Objects
Created Object Name Type Description
Report

Following read trimming, we'll check to ensure that 1) all adapters are removed (if they even existed in the 1st place) and 2) that the reads are of sufficient quality for assembly.

A quality control application for high throughput sequence data.
This app completed without errors in 1h 53m 36s.
Report
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/75811
• Reads.t_paired_75811_5_1.fwd_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
• Reads.t_paired_75811_5_1.rev_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report

With clean reads, we'll now assemble these reads into contigs using another popular assembler, MetaSPAdes . Of the assemblers currently available in the KBase ecosystem, I'd argue MetaSPAdes performs slightly better than MEGAHIT and better than IDBA-UD. For an excellent review of viral benchmarks regarding assemblers, see this PeerJ article .

This app completed without errors in 17h 14m 47s.
Objects
Created Object Name Type Description
Report
Summary
Assembly saved to: bbolduc:narrative_1603837393062/SPAdes.contigs Assembled into 18226 contigs. Avg Length: 5849.286678371557 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 17860 -- 2000.0 to 31009.3 bp 249 -- 31009.3 to 60018.6 bp 67 -- 60018.6 to 89027.9 bp 20 -- 89027.9 to 118037.2 bp 13 -- 118037.2 to 147046.5 bp 7 -- 147046.5 to 176055.8 bp 3 -- 176055.8 to 205065.1 bp 0 -- 205065.1 to 234074.4 bp 4 -- 234074.4 to 263083.69999999995 bp 3 -- 263083.69999999995 to 292093.0 bp

## Cleaning up the contigs¶

We'll be using VirSorter to identify viral genomes. One caveat is that VirSorter does not work as well for small, linear genomes. As a general rule of thumb, we'll filter out any contigs that are less than 5-kb in length OR any non-circular genomes less than 1.5-kb. Since we don't (yet) have a way to filter on circularity, we'll play it safe and remove any contigs below 5-kb.

Allows users to extract the longer contigs from Assembly objects.
This app completed without errors in 2m 11s.
Objects
Created Object Name Type Description
Summary
ORIGINAL Contig count: 18226 in Assembly SPAdes.contigs FILTERED Contig count: 5019 in Assembly SPAdes.contigs.5k

## Identifying viral genomes using VirSorter

At this point, we're ready to identify what sequences are potentially viral. For this, we'll use VirSorter.

Default parameters are fine. However, though the default Reference database is RefSeq DB, as long as you trust viral data from virome datasets, you can (should?) use Virome DB. One final note: if the metagenomic database is primarily viral (e.g., purified viral isolates, 0.22 µm filtered bulk samples) then select "Enable virome decontamination."

Identifies viral sequences from viral and microbial metagenomes
This app completed without errors in 3h 44m 14s.
Objects
Created Object Name Type Description
VirSorter-Category-2 Assembly KBase Assembly object from VIRSorter
VirSorter-Category-5 Assembly KBase Assembly object from VIRSorter
VirSorter-Category-6 Assembly KBase Assembly object from VIRSorter
VirSorter-Category-1 Assembly KBase Assembly object from VIRSorter
VirSorter-Category-3 Assembly KBase Assembly object from VIRSorter
VirSorter-Category-4 Assembly KBase Assembly object from VIRSorter
VirSorter_binnedContigs BinnedContigs BinnedContigs from VIRSorter
Report
Summary
Here are the results from your VIRSorter run. Above, you'll find a report with all the identified (putative) viral genomes, and below, links to the report as well as files generated.
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/75811
• VIRSorter_predicted_viral_fna.tar.gz - FASTA-formatted nucleotide sequences of VIRSorter predicted viruses
• VIRSorter_predicted_viral_gb.tar.gz - Genbank-formatted sequences of VIRSorter predicted viruses

## A bit of KBase maintenance...

Once VirSorter finishes, there will be 1 KBase Assembly object for each VirSorter category (up to 6!), as well as a VirSorter binnedContigs object that includes a bin for each category (just 1 object). For those unfamiliar with VirSorter and its outputs, there are 6 "categories" corresponding to confidence levels and whether or not a prophage was detected. Categories 1-3 are predicted lytic (or rather, >80% of the contig is predictted as viral), and categories 4-6 are prophage. Category 1 is the highest confidence viral category, and category 4 is the highest confidence prophage category. For most automated pipelines, it is recommended to select categories 1-2, and 3-4.

Eventually, we'll need to classify these viral sequences using vConTACT2 . For this, we need to go through a few KBase hoops to get everything in the right format. The first step is to only select the bins we want to continue processing.

In "Modify Bins", under Configure, select the two bins to merge. The "New Bin Name" and "Output BinnedContigs Name" can be the same.

Afterward, we need to extract those Bins as a KBase assembly object because Prokka requires Genome or Assembly objects to work.

Add or remove specific bins by name in BinnedContigs data
This app completed without errors in 57s.
Summary
Job Finished Generated BinnedContigs: VirSorter_bins1245 [75811/56/2] -------------------------- Summary: Binned contigs: 895 Total size of bins: 1 Bin IDs: VirSorter_cat1245
Extract a bin as an Assembly from a BinnedContig dataset
This app completed without errors in 1m 37s.
Objects
Created Object Name Type Description
VirSorter_cat1245_assembly Assembly Assembly object of extracted contigs
Summary
Job Finished Generated Assembly Reference: 75811/58/1

## Annotate viral genomes with Prokka

vConTACT2 requires one fundamental input - a file that maps predicted genes to their genomes (more below, called the Gene2Genome file). And for that, genes need to be predicted. We go with Prokka here because it does a decent job at predicting genes - and while it's not the best when it comes to viral genomes, it's good enough for our purposes.

Essentially, what is happening is Prokka will call and annotate genes and save that information in the Genome object generated. vConTACT2 will read that information associated with each gene and genome, and create the mapping file.

Prokka's options should be adjusted to viral Kingdom.

Annotate Assembly and Re-annotate Genomes with Prokka annotation pipeline.
This app completed without errors in 4m 43s.
Objects
Created Object Name Type Description
Virsorter_cat1245 Genome Annotated Genome
Summary
Annotated Genome saved to: bbolduc:narrative_1603837393062/Virsorter_cat1245 Number of genes predicted: 17696 Number of protein coding genes: 17465 Number of genes with non-hypothetical function: 2274 Number of genes with EC-number: 834 Number of genes with Seed Subsystem Ontology: 761 Average protein length: 254 aa.
Output from Annotate Assembly and Re-annotate Genomes with Prokka(v1.12)
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/75811

## Classify annotated viral genomes with vConTACT2

Now that we've handled the minor details, we can take those called genes and use them in vConTACT2 .

vConTACT2 works by using a gene-sharing network to associate viral genomes. The more genes that are shared between two genomes, the higher the probability of those two genomes being phylogenetically related.

So what's happening in the background? vConTACT2 will extract each viral genome and its associated gene predictions, and build the Gene2Genome table that underpins the whole analysis. Thankfully, vConTACT2+KBase generates this file in the background! For non-KBase users, this could be a challenge unless you let vConTACT2 handle everything.

There are a lot of options for vConTACT2. As a developer, there's a balance between giving enough options to allow for granular control of how the tool operates, and not over-burdening the user with options most are unlikely to change. In KBase, all the default options have been selected. There's no need to change anything - except if you want to use the most recent version of NCBI's Viral RefSeq. Often, users prefer to use the "older" version as that's what was used in the publication, so they're looking for consistency. If you'd like to use the most recent, then there might be very minor differences.

Viral cluster automatic cluster taxonomy
This app completed without errors in 1h 26m 12s.
Report
Summary
Basic message to show in the report
Output from vConTACT2 0.9.19
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/75811

## A closer look at the vConTACT2 results

After you've run vConTACT2, you'll get a table with ALL of the genomes in the analysis. The table can be a bit unwieldly as it contains a lot of rows and columns.

The easiest way to manage this (in KBase) is to use the filter function to find YOUR viral genomes. For example, all of our genomes contain NODE - it's a byproduct of the SPAdes assembler. By adding NODE in the appropriate filtering row under the column "Genome" will filter out all the reference genomes.

There are 895 viral genomes remaining. This is great, why? It matches the number of genomes that were annotated by Prokka. Now, how many of our viral genomes are clustered? Add Clustered to the "VC Status" column. There are 495 Clustered (or Clustered/Singletons) genomes. Not bad, not great - but this is actual data - not pretty "mock" data.

Now let's find out if our viral genomes are associated with any references. This is not fast through the table in KBase, but it is doable, unless you want to download the csv and do some data wrangling. What I do is sort the table by "VC" and scroll through, keeping track of the "Size" and count of the VC. If the "Size" of the VC is greater than the counts of the VC, make note of that VC.

After sorting by VC...

And finding some interesting clusters!

One of the first examples we encounter is VC_332_0. It has a VC size of 3, but only 1 member is seen. Remember - we still have the NODE filter on! Remove that NODE filter, revealing Cyanophage PSS2 and Synechococcus phage S-CBS2.

This is excellent, as the Cyanophage and Synechococcus phage are incredibly common in the ocean. A literature search revealed that these sequences are indeed found at the station our SRA reads are derived!

Additional searching reveals at least 7 other VCs that include both reference data and environmental sequences:

• VC_286
• VC_331
• VC_332
• VC_333
• VC_339
• VC_461
• VC_497
• VC_506

And taking a quick peek into one of the above VCs...

## A quick reference guide to vConTACT2

### What is a "VC"?

A Viral Cluster is a unit of classification. vConTACT2 uses two terms - with subtle differences - in order to classify sequences.

A VC is a first-pass classification of viral genomes. This frequently represents a group of genomes within the same genus. An example is VC_115.

A VC Subcluster is a second-pass classification. It uses a pre-calculated distance calculation to refine the VCs. These are high-confidence, genus-level groupings. It is common for there to be no change between VC and VC Subcluster. An example is VC_115_0. The final value ("0") represents the subcluster within the original VC. If all members of a VC Subcluster have VC_xx_0, then there was no change. However, if there were further refinements, then the VC Subcluster would be: VC_115_0, VC_115_1, VC_115_2...

### VC Status

How vConTACT2 describes how it classified a sequence

Clustered: Genomes "successfully" placed into a genus-level group alongside at least one other genome.

Singleton: Genome was not found to be related to any other genome in the dataset. Most likely reason? Very weak or no overlap with any other genes found on any other genome. How to fix? Add more related genomes.

Clustered/Singleton: Genome was initially clustered, but distance-based optimization identified its placement in the cluster as not genus-level. However, no other genomes were found to be within the same "subcluster" as this genome, and resulted in the genome being stranded, without another member. For these genomes, you can look at its "VC" to see distantly related members. So a viral genome that is Clustered/Singleton (VC_221_1 or VC_221_2) is related to other VC_221 members, but vConTACT2 does not have confidence that these genomes are related at the genus level.

As an aside, VC_221 contains 6 genomes. Four (4) are in VC_221_0, with the other two in VC_221_1 and VC_221_2. Both those genomes are related to the four members of 221_0, but not at the genus level.

Overlap (VC_NN/VC_XX): Genomes identified as sharing significant portions of its gene content with multiple VCs. In other words, vConTACT2 cannot confidently assign it to one OR the other VC. This is incredibly common for viral groups that undergo extensive recombination.

Outlier: Genomes clustered by ClusterONE (a tool used internally by vConTACT2) but were not strongly connected to the other VC members. It is common for these genomes to share a single gene or two to the other members of its closest VC, however the other members in that VC are likely be be sharing 20, 30 or 50+ genes. It is not only unlikely that that particular genome is related at the genus level, but could perhaps be a spurious shared gene and is unlikely to be related at anything lower than family or order.

## Conclusions

In summary, we've processed a viral metagenome from public reads available on SRA, identified contigs from the assembled sequence data as putative viruses using VirSorter, and classified them in approximately genus-level clusters with vConTACT2. This analysis revealed 8 VCs where environmental sequence data was found associated with references, and we can have confidence that those sequences at related to those references at the genus level.

Additionally, we've seen a few larger clusters with no associations to reference sequences that could be further investigated using KBase tools. For example - align those genomes in those VCs, identify shared features, extract those features, and make a discovery about certain proteins found throughout your dataset. Or, go one step further and pull in JGI data and then compare against a variety of JGI datasets for global significance - all using existing KBase apps!

### Thanks for following along and using these apps for your research!

• Roux S, Enault F, Hurwitz BL, Sullivan MB. VirSorter: mining viral signal from microbial genomic data. PeerJ. 2015;3: e985. doi:10.7717/peerj.985
• The original VirSorter paper
• Bolduc B, Jang HB, Doulcier G, You Z-QZ, Roux S, Sullivan MB. vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect Archaea and Bacteria. PeerJ. 2017;5: e3243. doi:10.7717/peerj.3243
• The initial vConTACT paper describing the theory and background

• Jang HB, Bolduc B, Zablocki O, Kuhn JH, Roux S, Adriaenssens EM, et al. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat Biotechnol. 2019;37: 632–639. doi:10.1038/s41587-019-0100-8
• The significantly improved version of vConTACT that's faster, more accurate, and capable of handling larger datasets

## Known Issues & Update History

Known Issues

• VirSorter may generate a "Bad KBase object" error. Can be safely ignored
• VirSorter DOES NOT LIKE to use Diamond unless adding trusted genomes
• vConTACT2 can only work on a single annotated Genome and Assembly object

Update History (MM-DD-YYYY)

11-03-2020

• Revised introduction
• Significantly expanded vConTACT2 results section
• Added vConTACT2 quick reference guide
• Expanded conclusion section

10-26-2020

• Initial draft of tutorial built

## Feedback & Helpdesk

If you have a question about one of our apps, need to report a bug or have another system-related query, please join our Help Board and post a ticket. Learn about how to do this here: http://kbase.us/help-board.

### Released Apps

1. Annotate Assembly and Re-annotate Genomes with Prokka(v1.12)
• Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014;30: 2068 2069. doi:10.1093/bioinformatics/btu153
• Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017; 27:824 834. doi: 10.1101/gr.213959.116
3. Assess Read Quality with FastQC - v0.11.5
• FastQC source: Bioinformatics Group at the Babraham Institute, UK.
4. Extract Bins as Assemblies from BinnedContigs - v1.0.2
• Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163
5. Filter Assembled Contigs by Length - v1.1.2
• Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163
6. Import SRA File as Reads From Web - v1.0.7
• Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163
7. Modify Bins in BinnedContigs - v1.0.2
• Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163
8. Trim Reads with Trimmomatic - v0.36
• Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30: 2114 2120. doi:10.1093/bioinformatics/btu170
9. VirSorter 1.0.5
• Roux S, Enault F, Hurwitz BL, Sullivan MB. (2015). VirSorter: mining viral signal from microbial genomic data. PeerJ 3:e985.

### Apps in Beta

1. vConTACT2 0.9.19
• Bin Jang, H., Bolduc, B., Zablocki, O., Kuhn, J. H., Roux, S., Adriaenssens, E. M., Sullivan, M. B. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nature Biotechnology. 2019;37(6): 632 639. https://doi.org/10.1038/s41587-019-0100-8