Generated December 3, 2020

Build and Visualize a Pangenome

Student

Authors: Ellen Dow and Carlos Goller

Topics in Biology Course Applications for KBase

Synopsis: This module introduces students to the concept of pangenomes. A pangenome is useful in studying sets of genomes to learn about "core" and "accessory genes" (Rouli et al. 2015). Tools to build and visualize a pangenome are needed to begin to identify core components and accessory elements. Here is a useful review.

Audience

  • Undergraduate Students
  • Graduate Students

Learning Goals

At the end of this module, you should be able to:

  • Explain annotation (which is also covered in Genome and Metagenome Modules)
  • Define the concept of pangenome.
  • Explain why is visualizing a pangenome useful and interpret representative examples pof pangenomes.
  • List the main objects and main steps in the process of building a pangenome.

Graduate level

  • Identify quality control steps in the process of building pangenomes
  • Evaluate limitations of pangenome representations.

Biological Topics and Concepts

  • taxonomy
  • pangenome
  • variance in assembly and annotation

Activity Description

This Narrative is an introduction to the workflow of building and visualizing a pangenome. Participants will build a pangenome from a series of available Staphylococcus aureus genomes with the goal of visualizing core and accessory elements.

Pangenome WorkFlow

  1. Build and Visualize a Pangenome

Optional Adventures

  1. Comparing Features
  2. Phylogenomics

Version

v1.0 (7 Oct 2020): Student
v0.9 (23 Sept 2020): Fall 2020 Semester
v0.1 (21 Aug 2020): Drafting

So, what makes a pangenome?

Overview

A pangenome represents all genes found within a collection of related organisms, grouped by how similar sequences are to one another, also referred to as sequence homology. One primary purpose of creating a pangenome is to distinguish which genes are orthologs - vertically inherited genes - and which genes arose from duplication events.

To learn more about how organisms are related or even how specific genes came about, we can use pangenomes to examine similarities and differences across a collection of genomes. Part of the theory behind pangenomes is the existence of a core genome, what is consistent across all strains or species and then the flexible or non-core genome where variation exists.

Evolution

There are several theories on how organisms evolved and how we can see this through phenotypes and pangenomes.

Resources to hyperlink:

Importing genomes

We'll first need to pull together a set of genomes. In this case, we are continuing our exploration of Staphylococcus aureus from the Genome Modules and will import both the MRSA and MSSA strains.

Public Data

We are continuing to work with publicly available data for a strain named Staphylococcus aureus MRSA177. The raw data was generated by the Genome Center at Washington University School of Medicine in St. Louis as part of the Human Microbiome Project - a large initiative to better understand human-associated microbes. The data files were imported from the NCBI Sequence Read Archive (SRA), which is a primary US repository for DNA sequence data. These data are from Acesssion SRX036759.

Importing Data

From Genome Module Part 4

1) In the upper left hand panel, under "DATA", click the red "+" button - this is "Add Data".

2) A list of all the data contained in your narratives should pop up. Select the genome object ( Staphylococcus aureus ), and click the blue "< copy" button that appears to import it into this narrative.

3) Click on the name of the object to add it to the Narrative below. Look to make sure that you're using the right data!

4) Repeat with the other genome.

If starting from scratch, import these two genomes from Public Data:

  • Staphylococcus aureus MRSA strain GCF_000187165.1
  • Staphylococcus aureus MSSA strain GCF_000684475.1

ALTERNATIVELY, import an already created GenomeSet and skip the step to create genome set.

Create a set of Staphylococcus genomes

To build a pangenome, we will gather genomes related to Staphylococcus aureus. Group the Methicillin-resistant Staphylococcus aureus (MRSA) and Methicillin-sensitive Staphylococcus aureus (MSSA) strain genomes together with Batch Create GenomeSet App. Set the Output GenomeSet Name to describe the output objet.

Next, we will find all closely related genomes through the Insert Set of Genomes into SpeciesTree, which uses FastTree to quickly calculate insights into relationships across similar sequences. Similarly to the Phylogenetics Module, we are trying to gather sequences that are similar together and create an object with the sequences to start building our pangenome. Choose the GenomeSet as the input object and set the parameters to be at 30 to 35 neighbor public genome count. Give the output names a short but descriptive name for the output.

Allows user to create a GenomeSet without specifying names
This app is new, and hasn't been started.
No output found.
Add a user-provided GenomeSet to a KBase SpeciesTree.
This app is new, and hasn't been started.
No output found.

Putting together a pangenome

Now that we have a SpeciesTree, let's take a look at where our genomes fall within the tree and what species are present.

Double click on the Tree object from the output in the Data Panel to view the tree.

Questions to answer:

Q1) How many genomes of Staphylococcus aureus are present?

Q2) Are genera other than Staphylococcus present? If so, which genera?

Remove non-Staphylococcus genera

If there are any genera that are not Staphylococcus in the output GenomeSet, we will need to remove these genomes from the data set. Togle to Beta Apps by clicking on the R in the APPS panel. Search for Remove Genomes from GenomeSet App. Open the App. select the GenomeSet and remove individual genomes by name from the GenomeSet. Click run.

Allows user to remove Genome(s) from a GenomeSet
This app is new, and hasn't been started.
No output found.

Annotating your genome set

Genomes from RefSeq do not have NCBI annotations. To keep using KBase tools, we must perform gene functional annotation of all of the Genome objects using RAST. Run Annotate Multiple Microbial Genomes with RASTtk to do this for the whole GenomeSet.

Use the Annotate Multiple Microbial Genomes App, which uses the RAST pipeline to annotate the genome. This will result in the output of a "genome" object. In KBase, a genome is defined as an object describing the genes and other genetic elements encoded within an organism, not just the raw sequence of the genome which came from our assembly. Required options in the annotation tool are indicated with a red line - be sure to specify the correct assembly, the scientific name of the organism, and give it an informative 'Genome object" name at the bottom.

Note: This might take a little while depending on the queue, but should be less than 1 hour.

Annotate or re-annotate bacterial or archaeal genomes and/or genome sets using RASTtk (Rapid Annotations using Subsystems Technology toolkit).
This app is new, and hasn't been started.
No output found.

Building a pangenome

There are two methods to use to build a pangenome in KBase. The two Apps run sequence homology calculations to generate a Pangenome object to use for more analysis. OrthoMCL uses a Markov Cluster-based algorithm to group predicted orthologs and paralogs, while compute pangenome is a rapid analysis that groups based on k-mers. While one method is much faster, it might not have the same resolution.

Questions to answer:

Q3) Which method would you use and why?

Q4) When would it be useful to use the method that you did not choose?

Build with OrthoMCL

Open the Build Pangenome with OrthoMCL App. Choose the annotated GenomeSet as the Input object and give a descriptive name for the output pangenome.

Compute Pangenome

Open the Compute Pangenome App. Choose the annotated GenomeSet as the Input object and give a descriptive name for the output pangenome.

While these are running, move to the next step.

Create a Pangenome object by performing OrthoMCL orthologous groups construction on a set of Genomes.
This app is new, and hasn't been started.
No output found.
Allows users to compute a pangenome from a set of individual genomes.
This app is new, and hasn't been started.
No output found.

Repeat for a second, smaller pangenome

This time, run OrthoMCL again with only the sequences that are within the Staphylococcus aureus clade.

This set will be smaller than the prior version. Use Build GenomeSet to define the list of genomes to include. Here, we'll just select the annotated Staphylococcus aureus genomes. Be sure to set the "Output Objects" name to specify this.

Run the Build Pangenome with OrthoMCL App with the Staphylococcus aureus GenomeSet as the input. This will cluster genes from the strains into groups.

Then run the Compute Pangenome App. Choose the annotated GenomeSet as the input object and give a descriptive name for the output pangenome.

Allows users to create a GenomeSet object.
This app is new, and hasn't been started.
No output found.
Create a Pangenome object by performing OrthoMCL orthologous groups construction on a set of Genomes.
This app is new, and hasn't been started.
No output found.
Allows users to compute a pangenome from a set of individual genomes.
This app is new, and hasn't been started.
No output found.

Check out your pangenomes

Double click on the output object to get an Overview and get a quick glance at the Genome Comparision and Families.

Compare the outputs between the Build Pangenome with OrthoMCL App and Compute Pangenome App.

Questions to answer:

Q5) How do the outputs vary between the two pangenome methods? Are they the same? Why or why not?

Q6) What are the similarities between the pangenomes with different datasets using the same method of building a pangenome? What are the differences?

Q7) What is the number of translated genes?

Q8) What is the number of genes in homolog families?

Q9) What is the number of genes in singleton families?

Q10) What does the OrthoMCL App do? Why?

Comprehension questions

CQ1) Why might creating a pangenome be helpful in analyzing your data?

CQ2) If focusing on a single species, how would you know that you have all possible genomes?

Visualization

Another way to analyze pangenomes is visually. It can be helpful to look at the same analysis and data in a few different ways to better understand it. We took a look at the raw outputs and compared numbers, now lets create a visual interpretation for the pangenome.

Pangenome CirclePlot

What does a CirclePlot tell us?

Ideal

Clostridia Example from Dylan Chivian