Authors: Ellen Dow and Carlos Goller
Synopsis: This module introduces students to the concept of pangenomes. A pangenome is useful in studying sets of genomes to learn about "core" and "accessory genes" (Rouli et al. 2015). Tools to build and visualize a pangenome are needed to begin to identify core components and accessory elements. Here is a useful review.
At the end of this module, you should be able to:
This Narrative is an introduction to the workflow of building and visualizing a pangenome. Participants will build a pangenome from a series of available Staphylococcus aureus genomes with the goal of visualizing core and accessory elements.
v1.0 (7 Oct 2020): Student
v0.9 (23 Sept 2020): Fall 2020 Semester
v0.1 (21 Aug 2020): Drafting
A pangenome represents all genes found within a collection of related organisms, grouped by how similar sequences are to one another, also referred to as sequence homology. One primary purpose of creating a pangenome is to distinguish which genes are orthologs - vertically inherited genes - and which genes arose from duplication events.
To learn more about how organisms are related or even how specific genes came about, we can use pangenomes to examine similarities and differences across a collection of genomes. Part of the theory behind pangenomes is the existence of a core genome, what is consistent across all strains or species and then the flexible or non-core genome where variation exists.
There are several theories on how organisms evolved and how we can see this through phenotypes and pangenomes.
Resources to hyperlink:
We'll first need to pull together a set of genomes. In this case, we are continuing our exploration of Staphylococcus aureus from the Genome Modules and will import both the MRSA and MSSA strains.
We are continuing to work with publicly available data for a strain named Staphylococcus aureus MRSA177. The raw data was generated by the Genome Center at Washington University School of Medicine in St. Louis as part of the Human Microbiome Project - a large initiative to better understand human-associated microbes. The data files were imported from the NCBI Sequence Read Archive (SRA), which is a primary US repository for DNA sequence data. These data are from Acesssion SRX036759.
From Genome Module Part 4
1) In the upper left hand panel, under "DATA", click the red "+" button - this is "Add Data".
2) A list of all the data contained in your narratives should pop up. Select the genome object ( Staphylococcus aureus ), and click the blue "< copy" button that appears to import it into this narrative.
3) Click on the name of the object to add it to the Narrative below. Look to make sure that you're using the right data!
4) Repeat with the other genome.
If starting from scratch, import these two genomes from Public Data:
ALTERNATIVELY, import an already created GenomeSet and skip the step to create genome set.
To build a pangenome, we will gather genomes related to Staphylococcus aureus. Group the Methicillin-resistant Staphylococcus aureus (MRSA) and Methicillin-sensitive Staphylococcus aureus (MSSA) strain genomes together with Batch Create GenomeSet App. Set the Output GenomeSet Name to describe the output objet.
Next, we will find all closely related genomes through the Insert Set of Genomes into SpeciesTree, which uses FastTree to quickly calculate insights into relationships across similar sequences. Similarly to the Phylogenetics Module, we are trying to gather sequences that are similar together and create an object with the sequences to start building our pangenome. Choose the GenomeSet as the input object and set the parameters to be at 30 to 35 neighbor public genome count. Give the output names a short but descriptive name for the output.
Now that we have a SpeciesTree, let's take a look at where our genomes fall within the tree and what species are present.
Double click on the Tree object from the output in the Data Panel to view the tree.
Q1) How many genomes of Staphylococcus aureus are present?
Q2) Are genera other than Staphylococcus present? If so, which genera?
If there are any genera that are not Staphylococcus in the output GenomeSet, we will need to remove these genomes from the data set. Togle to Beta Apps by clicking on the R in the APPS panel. Search for Remove Genomes from GenomeSet App. Open the App. select the GenomeSet and remove individual genomes by name from the GenomeSet. Click run.
Genomes from RefSeq do not have NCBI annotations. To keep using KBase tools, we must perform gene functional annotation of all of the Genome objects using RAST. Run Annotate Multiple Microbial Genomes with RASTtk to do this for the whole GenomeSet.
Use the Annotate Multiple Microbial Genomes App, which uses the RAST pipeline to annotate the genome. This will result in the output of a "genome" object. In KBase, a genome is defined as an object describing the genes and other genetic elements encoded within an organism, not just the raw sequence of the genome which came from our assembly. Required options in the annotation tool are indicated with a red line - be sure to specify the correct assembly, the scientific name of the organism, and give it an informative 'Genome object" name at the bottom.
Note: This might take a little while depending on the queue, but should be less than 1 hour.
There are two methods to use to build a pangenome in KBase. The two Apps run sequence homology calculations to generate a Pangenome object to use for more analysis. OrthoMCL uses a Markov Cluster-based algorithm to group predicted orthologs and paralogs, while compute pangenome is a rapid analysis that groups based on k-mers. While one method is much faster, it might not have the same resolution.
Q3) Which method would you use and why?
Q4) When would it be useful to use the method that you did not choose?
Open the Build Pangenome with OrthoMCL App. Choose the annotated GenomeSet as the Input object and give a descriptive name for the output pangenome.
Open the Compute Pangenome App. Choose the annotated GenomeSet as the Input object and give a descriptive name for the output pangenome.
While these are running, move to the next step.
This time, run OrthoMCL again with only the sequences that are within the Staphylococcus aureus clade.
This set will be smaller than the prior version. Use Build GenomeSet to define the list of genomes to include. Here, we'll just select the annotated Staphylococcus aureus genomes. Be sure to set the "Output Objects" name to specify this.
Run the Build Pangenome with OrthoMCL App with the Staphylococcus aureus GenomeSet as the input. This will cluster genes from the strains into groups.
Then run the Compute Pangenome App. Choose the annotated GenomeSet as the input object and give a descriptive name for the output pangenome.
Double click on the output object to get an Overview and get a quick glance at the Genome Comparision and Families.
Q5) How do the outputs vary between the two pangenome methods? Are they the same? Why or why not?
Q6) What are the similarities between the pangenomes with different datasets using the same method of building a pangenome? What are the differences?
Q7) What is the number of translated genes?
Q8) What is the number of genes in homolog families?
Q9) What is the number of genes in singleton families?
Q10) What does the OrthoMCL App do? Why?
CQ1) Why might creating a pangenome be helpful in analyzing your data?
CQ2) If focusing on a single species, how would you know that you have all possible genomes?
Another way to analyze pangenomes is visually. It can be helpful to look at the same analysis and data in a few different ways to better understand it. We took a look at the raw outputs and compared numbers, now lets create a visual interpretation for the pangenome.
What does a CirclePlot tell us?
Clostridia Example from Dylan Chivian