Build and Visualize a Pangenome¶

Student

Authors: Ellen Dow and Carlos Goller

Topics in Biology Course Applications for KBase¶

Synopsis: This module introduces students to the concept of pangenomes. A pangenome is useful in studying sets of genomes to learn about "core" and "accessory genes" (Rouli et al. 2015). Tools to build and visualize a pangenome are needed to begin to identify core components and accessory elements. Here is a useful review.

Audience¶

Undergraduate Students
Graduate Students

Learning Goals¶

At the end of this module, you should be able to:

Explain annotation (which is also covered in Genome and Metagenome Modules)
Define the concept of pangenome.
Explain why is visualizing a pangenome useful and interpret representative examples pof pangenomes.
List the main objects and main steps in the process of building a pangenome.

Graduate level

Identify quality control steps in the process of building pangenomes
Evaluate limitations of pangenome representations.

Biological Topics and Concepts¶

taxonomy
pangenome
variance in assembly and annotation

Activity Description¶

This Narrative is an introduction to the workflow of building and visualizing a pangenome. Participants will build a pangenome from a series of available Staphylococcus aureus genomes with the goal of visualizing core and accessory elements.

Pangenome WorkFlow

Build and Visualize a Pangenome

Optional Adventures

Version¶

v1.0 (7 Oct 2020): Student
v0.9 (23 Sept 2020): Fall 2020 Semester
v0.1 (21 Aug 2020): Drafting

So, what makes a pangenome?¶

Overview¶

A pangenome represents all genes found within a collection of related organisms, grouped by how similar sequences are to one another, also referred to as sequence homology. One primary purpose of creating a pangenome is to distinguish which genes are orthologs - vertically inherited genes - and which genes arose from duplication events.

To learn more about how organisms are related or even how specific genes came about, we can use pangenomes to examine similarities and differences across a collection of genomes. Part of the theory behind pangenomes is the existence of a core genome, what is consistent across all strains or species and then the flexible or non-core genome where variation exists.

Evolution¶

There are several theories on how organisms evolved and how we can see this through phenotypes and pangenomes.

Resources to hyperlink:

Why prokaryotes have pangenomes
- also has a good schematic of a venn diagram
What is speciation?
Biological Species Are Universal across Life’s Domains
- good definitions of core genomes
The Landscape of Realized Homologous Recombination in Pathogenic Bacteria
- background on pathogenic strains
The Origins of Genome Complexity

Importing genomes¶

We'll first need to pull together a set of genomes. In this case, we are continuing our exploration of Staphylococcus aureus from the Genome Modules and will import both the MRSA and MSSA strains.

Public Data¶

We are continuing to work with publicly available data for a strain named Staphylococcus aureus MRSA177. The raw data was generated by the Genome Center at Washington University School of Medicine in St. Louis as part of the Human Microbiome Project - a large initiative to better understand human-associated microbes. The data files were imported from the NCBI Sequence Read Archive (SRA), which is a primary US repository for DNA sequence data. These data are from Acesssion SRX036759.

Importing Data¶

From Genome Module Part 4

1) In the upper left hand panel, under "DATA", click the red "+" button - this is "Add Data".

2) A list of all the data contained in your narratives should pop up. Select the genome object ( Staphylococcus aureus ), and click the blue "< copy" button that appears to import it into this narrative.

3) Click on the name of the object to add it to the Narrative below. Look to make sure that you're using the right data!

4) Repeat with the other genome.

If starting from scratch, import these two genomes from Public Data:

Staphylococcus aureus MRSA strain GCF_000187165.1
Staphylococcus aureus MSSA strain GCF_000684475.1

ALTERNATIVELY, import an already created GenomeSet and skip the step to create genome set.

Create a set of Staphylococcus genomes¶

To build a pangenome, we will gather genomes related to Staphylococcus aureus. Group the Methicillin-resistant Staphylococcus aureus (MRSA) and Methicillin-sensitive Staphylococcus aureus (MSSA) strain genomes together with Batch Create GenomeSet App. Set the Output GenomeSet Name to describe the output objet.

Next, we will find all closely related genomes through the Insert Set of Genomes into SpeciesTree, which uses FastTree to quickly calculate insights into relationships across similar sequences. Similarly to the Phylogenetics Module, we are trying to gather sequences that are similar together and create an object with the sequences to start building our pangenome. Choose the GenomeSet as the input object and set the parameters to be at 30 to 35 neighbor public genome count. Give the output names a short but descriptive name for the output.

Putting together a pangenome¶

Now that we have a SpeciesTree, let's take a look at where our genomes fall within the tree and what species are present.

Double click on the Tree object from the output in the Data Panel to view the tree.

Questions to answer:¶

Q1) How many genomes of Staphylococcus aureus are present?

Q2) Are genera other than Staphylococcus present? If so, which genera?

Remove non-Staphylococcus genera¶

If there are any genera that are not Staphylococcus in the output GenomeSet, we will need to remove these genomes from the data set. Togle to Beta Apps by clicking on the R in the APPS panel. Search for Remove Genomes from GenomeSet App. Open the App. select the GenomeSet and remove individual genomes by name from the GenomeSet. Click run.

Annotating your genome set¶

Genomes from RefSeq do not have NCBI annotations. To keep using KBase tools, we must perform gene functional annotation of all of the Genome objects using RAST. Run Annotate Multiple Microbial Genomes with RASTtk to do this for the whole GenomeSet.

Use the Annotate Multiple Microbial Genomes App, which uses the RAST pipeline to annotate the genome. This will result in the output of a "genome" object. In KBase, a genome is defined as an object describing the genes and other genetic elements encoded within an organism, not just the raw sequence of the genome which came from our assembly. Required options in the annotation tool are indicated with a red line - be sure to specify the correct assembly, the scientific name of the organism, and give it an informative 'Genome object" name at the bottom.

Note: This might take a little while depending on the queue, but should be less than 1 hour.

Building a pangenome¶

There are two methods to use to build a pangenome in KBase. The two Apps run sequence homology calculations to generate a Pangenome object to use for more analysis. OrthoMCL uses a Markov Cluster-based algorithm to group predicted orthologs and paralogs, while compute pangenome is a rapid analysis that groups based on k-mers. While one method is much faster, it might not have the same resolution.

Questions to answer:¶

Q3) Which method would you use and why?

Q4) When would it be useful to use the method that you did not choose?

Build with OrthoMCL¶

Open the Build Pangenome with OrthoMCL App. Choose the annotated GenomeSet as the Input object and give a descriptive name for the output pangenome.

Compute Pangenome¶

Open the Compute Pangenome App. Choose the annotated GenomeSet as the Input object and give a descriptive name for the output pangenome.

While these are running, move to the next step.

Repeat for a second, smaller pangenome¶

This time, run OrthoMCL again with only the sequences that are within the Staphylococcus aureus clade.

This set will be smaller than the prior version. Use Build GenomeSet to define the list of genomes to include. Here, we'll just select the annotated Staphylococcus aureus genomes. Be sure to set the "Output Objects" name to specify this.

Run the Build Pangenome with OrthoMCL App with the Staphylococcus aureus GenomeSet as the input. This will cluster genes from the strains into groups.

Then run the Compute Pangenome App. Choose the annotated GenomeSet as the input object and give a descriptive name for the output pangenome.

Check out your pangenomes¶

Double click on the output object to get an Overview and get a quick glance at the Genome Comparision and Families.

Compare the outputs between the Build Pangenome with OrthoMCL App and Compute Pangenome App.

Questions to answer:¶

Q5) How do the outputs vary between the two pangenome methods? Are they the same? Why or why not?

Q6) What are the similarities between the pangenomes with different datasets using the same method of building a pangenome? What are the differences?

Q7) What is the number of translated genes?

Q8) What is the number of genes in homolog families?

Q9) What is the number of genes in singleton families?

Q10) What does the OrthoMCL App do? Why?

Comprehension questions¶

CQ1) Why might creating a pangenome be helpful in analyzing your data?

CQ2) If focusing on a single species, how would you know that you have all possible genomes?

Visualization¶

Another way to analyze pangenomes is visually. It can be helpful to look at the same analysis and data in a few different ways to better understand it. We took a look at the raw outputs and compared numbers, now lets create a visual interpretation for the pangenome.

Pangenome CirclePlot¶

What does a CirclePlot tell us?

Ideal¶

Clostridia Example from Dylan Chivian

$Pangenome_fromDylan.png$

Reality¶

What does your own Pangenome look like?

Another way to look at it¶

Create a circle plot of your pangenome¶

To run this App successfully, the base genome needs to have the same permanent ID as what is listed within the pangenome. If permanent IDs of input objects do not match, the App will not run successfully.

1) Open the Pangenome Circle Plot App.

2) Select the Pangenome object and Base genome to visualize the Pangenome. Ensure the base genome is within the Narrative and has a matching permanent object ID. Select the "DO save feature sets" parameter.

3) Click Run.

Analyzing your pangenomes¶

Questions to answer:¶

Q13) How do the outputs vary between the two pangenome methods? Are they the same?

Q14) What are the similarities between the pangenomes with different datasets using the same method of building a pangenome? What are the differences?

Optional analysis

Add another Pangenome Circle Plot App cell and run with a different base genome to compare the two circle plots.¶

Questions to answer:¶

Q15) How does the circle plot change when using a different base genome?

Q16) Why do you think the plots are different?

Build and Visualize a Pangenome¶

Student

Topics in Biology Course Applications for KBase¶

Audience¶

Learning Goals¶

Biological Topics and Concepts¶

Activity Description¶

Pangenome WorkFlow

Version¶

So, what makes a pangenome?¶

Overview¶

Evolution¶

Importing genomes¶

Public Data¶

Importing Data¶

Create a set of Staphylococcus genomes¶

Putting together a pangenome¶

Questions to answer:¶

Remove non-Staphylococcus genera¶

Annotating your genome set¶

Building a pangenome¶

Questions to answer:¶

Build with OrthoMCL¶

Compute Pangenome¶

Repeat for a second, smaller pangenome¶

Check out your pangenomes¶

Questions to answer:¶

Comprehension questions¶

Visualization¶

Pangenome CirclePlot¶

Ideal¶

Reality¶

Another way to look at it¶

Create a circle plot of your pangenome¶

Analyzing your pangenomes¶

Questions to answer:¶

Optional analysis

Add another Pangenome Circle Plot App cell and run with a different base genome to compare the two circle plots.¶

Questions to answer:¶

Choose Your Next Adventure

Released Apps

Apps in Beta