Generated December 3, 2020

Build and Visualize a Pangenome

Student

Authors: Ellen Dow and Carlos Goller

Topics in Biology Course Applications for KBase

Synopsis: This module introduces students to the concept of pangenomes. A pangenome is useful in studying sets of genomes to learn about "core" and "accessory genes" (Rouli et al. 2015). Tools to build and visualize a pangenome are needed to begin to identify core components and accessory elements. Here is a useful review.

Audience

  • Undergraduate Students
  • Graduate Students

Learning Goals

At the end of this module, you should be able to:

  • Explain annotation (which is also covered in Genome and Metagenome Modules)
  • Define the concept of pangenome.
  • Explain why is visualizing a pangenome useful and interpret representative examples pof pangenomes.
  • List the main objects and main steps in the process of building a pangenome.

Graduate level

  • Identify quality control steps in the process of building pangenomes
  • Evaluate limitations of pangenome representations.

Biological Topics and Concepts

  • taxonomy
  • pangenome
  • variance in assembly and annotation

Activity Description

This Narrative is an introduction to the workflow of building and visualizing a pangenome. Participants will build a pangenome from a series of available Staphylococcus aureus genomes with the goal of visualizing core and accessory elements.

Pangenome WorkFlow

  1. Build and Visualize a Pangenome

Optional Adventures

  1. Comparing Features
  2. Phylogenomics

Version

v1.0 (7 Oct 2020): Student
v0.9 (23 Sept 2020): Fall 2020 Semester
v0.1 (21 Aug 2020): Drafting

So, what makes a pangenome?

Overview

A pangenome represents all genes found within a collection of related organisms, grouped by how similar sequences are to one another, also referred to as sequence homology. One primary purpose of creating a pangenome is to distinguish which genes are orthologs - vertically inherited genes - and which genes arose from duplication events.

To learn more about how organisms are related or even how specific genes came about, we can use pangenomes to examine similarities and differences across a collection of genomes. Part of the theory behind pangenomes is the existence of a core genome, what is consistent across all strains or species and then the flexible or non-core genome where variation exists.

Evolution

There are several theories on how organisms evolved and how we can see this through phenotypes and pangenomes.

Resources to hyperlink:

Importing genomes

We'll first need to pull together a set of genomes. In this case, we are continuing our exploration of Staphylococcus aureus from the Genome Modules and will import both the MRSA and MSSA strains.

Public Data

We are continuing to work with publicly available data for a strain named Staphylococcus aureus MRSA177. The raw data was generated by the Genome Center at Washington University School of Medicine in St. Louis as part of the Human Microbiome Project - a large initiative to better understand human-associated microbes. The data files were imported from the NCBI Sequence Read Archive (SRA), which is a primary US repository for DNA sequence data. These data are from Acesssion SRX036759.

Importing Data

From Genome Module Part 4

1) In the upper left hand panel, under "DATA", click the red "+" button - this is "Add Data".

2) A list of all the data contained in your narratives should pop up. Select the genome object ( Staphylococcus aureus ), and click the blue "< copy" button that appears to import it into this narrative.

3) Click on the name of the object to add it to the Narrative below. Look to make sure that you're using the right data!

4) Repeat with the other genome.

If starting from scratch, import these two genomes from Public Data:

  • Staphylococcus aureus MRSA strain GCF_000187165.1
  • Staphylococcus aureus MSSA strain GCF_000684475.1

ALTERNATIVELY, import an already created GenomeSet and skip the step to create genome set.

Create a set of Staphylococcus genomes

To build a pangenome, we will gather genomes related to Staphylococcus aureus. Group the Methicillin-resistant Staphylococcus aureus (MRSA) and Methicillin-sensitive Staphylococcus aureus (MSSA) strain genomes together with Batch Create GenomeSet App. Set the Output GenomeSet Name to describe the output objet.

Next, we will find all closely related genomes through the Insert Set of Genomes into SpeciesTree, which uses FastTree to quickly calculate insights into relationships across similar sequences. Similarly to the Phylogenetics Module, we are trying to gather sequences that are similar together and create an object with the sequences to start building our pangenome. Choose the GenomeSet as the input object and set the parameters to be at 30 to 35 neighbor public genome count. Give the output names a short but descriptive name for the output.

Allows user to create a GenomeSet without specifying names
This app is new, and hasn't been started.
No output found.
Add a user-provided GenomeSet to a KBase SpeciesTree.
This app is new, and hasn't been started.
No output found.

Putting together a pangenome

Now that we have a SpeciesTree, let's take a look at where our genomes fall within the tree and what species are present.

Double click on the Tree object from the output in the Data Panel to view the tree.

Questions to answer:

Q1) How many genomes of Staphylococcus aureus are present?

Q2) Are genera other than Staphylococcus present? If so, which genera?

Remove non-Staphylococcus genera

If there are any genera that are not Staphylococcus in the output GenomeSet, we will need to remove these genomes from the data set. Togle to Beta Apps by clicking on the R in the APPS panel. Search for Remove Genomes from GenomeSet App. Open the App. select the GenomeSet and remove individual genomes by name from the GenomeSet. Click run.

Allows user to remove Genome(s) from a GenomeSet
This app is new, and hasn't been started.
No output found.

Annotating your genome set

Genomes from RefSeq do not have NCBI annotations. To keep using KBase tools, we must perform gene functional annotation of all of the Genome objects using RAST. Run Annotate Multiple Microbial Genomes with RASTtk to do this for the whole GenomeSet.

Use the Annotate Multiple Microbial Genomes App, which uses the RAST pipeline to annotate the genome. This will result in the output of a "genome" object. In KBase, a genome is defined as an object describing the genes and other genetic elements encoded within an organism, not just the raw sequence of the genome which came from our assembly. Required options in the annotation tool are indicated with a red line - be sure to specify the correct assembly, the scientific name of the organism, and give it an informative 'Genome object" name at the bottom.

Note: This might take a little while depending on the queue, but should be less than 1 hour.

Annotate or re-annotate bacterial or archaeal genomes and/or genome sets using RASTtk (Rapid Annotations using Subsystems Technology toolkit).
This app is new, and hasn't been started.
No output found.

Building a pangenome

There are two methods to use to build a pangenome in KBase. The two Apps run sequence homology calculations to generate a Pangenome object to use for more analysis. OrthoMCL uses a Markov Cluster-based algorithm to group predicted orthologs and paralogs, while compute pangenome is a rapid analysis that groups based on k-mers. While one method is much faster, it might not have the same resolution.

Questions to answer:

Q3) Which method would you use and why?

Q4) When would it be useful to use the method that you did not choose?

Build with OrthoMCL

Open the Build Pangenome with OrthoMCL App. Choose the annotated GenomeSet as the Input object and give a descriptive name for the output pangenome.

Compute Pangenome

Open the Compute Pangenome App. Choose the annotated GenomeSet as the Input object and give a descriptive name for the output pangenome.

While these are running, move to the next step.

Create a Pangenome object by performing OrthoMCL orthologous groups construction on a set of Genomes.
This app is new, and hasn't been started.
No output found.
Allows users to compute a pangenome from a set of individual genomes.
This app is new, and hasn't been started.
No output found.

Repeat for a second, smaller pangenome

This time, run OrthoMCL again with only the sequences that are within the Staphylococcus aureus clade.

This set will be smaller than the prior version. Use Build GenomeSet to define the list of genomes to include. Here, we'll just select the annotated Staphylococcus aureus genomes. Be sure to set the "Output Objects" name to specify this.

Run the Build Pangenome with OrthoMCL App with the Staphylococcus aureus GenomeSet as the input. This will cluster genes from the strains into groups.

Then run the Compute Pangenome App. Choose the annotated GenomeSet as the input object and give a descriptive name for the output pangenome.

Allows users to create a GenomeSet object.
This app is new, and hasn't been started.
No output found.
Create a Pangenome object by performing OrthoMCL orthologous groups construction on a set of Genomes.
This app is new, and hasn't been started.
No output found.
Allows users to compute a pangenome from a set of individual genomes.
This app is new, and hasn't been started.
No output found.

Check out your pangenomes

Double click on the output object to get an Overview and get a quick glance at the Genome Comparision and Families.

Compare the outputs between the Build Pangenome with OrthoMCL App and Compute Pangenome App.

Questions to answer:

Q5) How do the outputs vary between the two pangenome methods? Are they the same? Why or why not?

Q6) What are the similarities between the pangenomes with different datasets using the same method of building a pangenome? What are the differences?

Q7) What is the number of translated genes?

Q8) What is the number of genes in homolog families?

Q9) What is the number of genes in singleton families?

Q10) What does the OrthoMCL App do? Why?

Comprehension questions

CQ1) Why might creating a pangenome be helpful in analyzing your data?

CQ2) If focusing on a single species, how would you know that you have all possible genomes?

Visualization

Another way to analyze pangenomes is visually. It can be helpful to look at the same analysis and data in a few different ways to better understand it. We took a look at the raw outputs and compared numbers, now lets create a visual interpretation for the pangenome.

Pangenome CirclePlot

What does a CirclePlot tell us?

Ideal

Clostridia Example from Dylan Chivian

Pangenome_fromDylan.png

Reality

What does your own Pangenome look like?

Another way to look at it

Create a circle plot of your pangenome

To run this App successfully, the base genome needs to have the same permanent ID as what is listed within the pangenome. If permanent IDs of input objects do not match, the App will not run successfully.

1) Open the Pangenome Circle Plot App.

2) Select the Pangenome object and Base genome to visualize the Pangenome. Ensure the base genome is within the Narrative and has a matching permanent object ID. Select the "DO save feature sets" parameter.

3) Click Run.

View a microbial Pangenome as a circle plot.
This app is new, and hasn't been started.
No output found.

Analyzing your pangenomes

Questions to answer:

Q13) How do the outputs vary between the two pangenome methods? Are they the same?

Q14) What are the similarities between the pangenomes with different datasets using the same method of building a pangenome? What are the differences?

Optional analysis

Add another Pangenome Circle Plot App cell and run with a different base genome to compare the two circle plots.

Questions to answer:

Q15) How does the circle plot change when using a different base genome?

Q16) Why do you think the plots are different?

Choose Your Next Adventure

  1. Comparing Features
  2. Phylogenomics

Released Apps

  1. Annotate Multiple Microbial Genomes with RAST - v1.073
    • [1] Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, et al. The RAST Server: Rapid Annotations using Subsystems Technology. BMC Genomics. 2008;9: 75. doi:10.1186/1471-2164-9-75
    • [2] Overbeek R, Olson R, Pusch GD, Olsen GJ, Davis JJ, Disz T, et al. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res. 2014;42: D206 D214. doi:10.1093/nar/gkt1226
    • [3] Brettin T, Davis JJ, Disz T, Edwards RA, Gerdes S, Olsen GJ, et al. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes. Sci Rep. 2015;5. doi:10.1038/srep08365
    • [4] Kent WJ. BLAT The BLAST-Like Alignment Tool. Genome Res. 2002;12: 656 664. doi:10.1101/gr.229202
    • [5] Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25: 3389-3402. doi:10.1093/nar/25.17.3389
    • [6] Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25: 955 964.
    • [7] Cobucci-Ponzano B, Rossi M, Moracci M. Translational recoding in archaea. Extremophiles. 2012;16: 793 803. doi:10.1007/s00792-012-0482-8
    • [8] Meyer F, Overbeek R, Rodriguez A. FIGfams: yet another set of protein families. Nucleic Acids Res. 2009;37 6643-54. doi:10.1093/nar/gkp698.
    • [9] van Belkum A, Sluijuter M, de Groot R, Verbrugh H, Hermans PW. Novel BOX repeat PCR assay for high-resolution typing of Streptococcus pneumoniae strains. J Clin Microbiol. 1996;34: 1176 1179.
    • [10] Croucher NJ, Vernikos GS, Parkhill J, Bentley SD. Identification, variation and transcription of pneumococcal repeat sequences. BMC Genomics. 2011;12: 120. doi:10.1186/1471-2164-12-120
    • [11] Hyatt D, Chen G-L, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11: 119. doi:10.1186/1471-2105-11-119
    • [12] Delcher AL, Bratke KA, Powers EC, Salzberg SL. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007;23: 673 679. doi:10.1093/bioinformatics/btm009
    • [13] Akhter S, Aziz RK, Edwards RA. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res. 2012;40: e126. doi:10.1093/nar/gks406
  2. Batch Create Genome Set - v1.2.0
    • Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163
  3. Build GenomeSet - v1.0.1
    • Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163
  4. Build Pangenome with OrthoMCL - v2.0
    • Li L, Stoeckert CJ, Roos DS. OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes. Genome Res. 2003;13: 2178 2189. doi:10.1101/gr.1224503
  5. Compute Pangenome
    • Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163
  6. Insert Set of Genomes Into SpeciesTree - v2.2.0
    • Price MN, Dehal PS, Arkin AP. FastTree 2 Approximately Maximum-Likelihood Trees for Large Alignments. PLoS One. 2010;5. doi:10.1371/journal.pone.0009490
  7. Pangenome Circle Plot - v1.2.0
    • Li L, Stoeckert CJ, Roos DS. OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes. Genome Res. 2003;13: 2178 2189. doi:10.1101/gr.1224503

Apps in Beta

  1. Remove Genomes from GenomeSet - v1.5.0
    • Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163