This tutorial will show you how to use the Compare Genomes from Pangenome app to create a pangenome in the KBase Narrative Interface and then navigate the results.
In this tutorial, you will:
The Compare Genomes from Pangenome app conducts a detailed comparison of genomes on the basis of protein sequence similarity and function. It begins by creating a pangenome for a set of closely related organisms. The pangenome is defined as the set of conserved and variable genes found within a set of related genomes. In many cases, a pangenome analysis is desirable for understanding which genes were gained and lost between strains or for inferring which genes may confer a phenotype to a given strain. For more information, please see the details page for this app.
Step 1. Compute Pangenome
Compute pangenome of a set of individual genomes.
Step 2. Compare Genomes from Pangenome
Compare isofunctional and homologous gene families for all genomes in the pangenome.
This app takes one or more “Genome” objects as input. In KBase, a “Genome” or “Genome typed object” is a special object type that contains the feature calls and annotation data for a genome. You can load genome data into KBase for analysis in a number of ways.
Alternatively, this app can take a Genome Set object as input, which can be created by the Insert Genomes into Species Tree app.
This tutorial will take you through the steps for running the Compare Genomes from Pangenome app using example data from KBase’s reference data collection. Once you’re ready to upload your own data, see the Data Upload and Download Guide for instructions on uploading contigs from a GenBank formatted file and for importing a GenBank genome from FTP.
This app generates two data objects: a Pangenome object and a Genome Comparison object. The pangenome object contains information about the number of times that each protein family occurs throughout the set of genomes (note that there can be a family with one member). The Genome Comparison object is a mapping of functions and families.
This app is designed to work on closely related genomes. We currently recommend using this tool to compare organisms whose 16S genes are at least 90% identical. Development of tools that will work on more distantly related taxa is under way.
The Compare Genomes from Pangenome app works by generating a measure of protein similarity using a rapid k-mer-based approach that bins proteins based on the number of signature amino acid 8-mers that they have in common. Unlike BLAST-based tools, the k-mer-based evaluation works quickly for large sets of genomes. For more information on how the algorithm works, please refer to the details page for the Compute Pangenome app.
Note: This tutorial assumes that you have already created a new Narrative. For instructions on how to accomplish this and other tasks such as finding or uploading data to your Narrative, please refer to the Narrative Interface User Guide.
Step 1. Add data that you want to analyze
The first step in running this app is to copy or upload the needed input data. For the point and click instructions, we will copy two annotated genomes into our Narrative from the KBase reference collection.
First, click the Add Data (or the “+”) button in the Data Panel on the left of your screen. (If you don’t see this button, make sure you have the Analyze tab selected.) The Data Browser will slide out, with tabs that show several data sources.
Choose the Public tab to see a list of public KBase reference data. Genomes are displayed by default, but the data types dropdown menu allows you to search for other types of data as well.
With Genomes selected, search for “Escherichia coli str. K-12 substr. MG1655.” In the list of results, you may notice two entries with this name. For this example, we will choose the genome with only one contig. Add the genome to your Narrative by mousing over it and then clicking the Add button that appears to its left. Next, search for “Escherichia coli W3110” and add this genome as well. (Again, we will choose the genome with one contig.)
Try this later
Once you are ready to analyze your own data, you may want to use genomes that you have annotated using the Annotate Microbial Contigs app.
Exit the Data Browser by clicking either the Close button at the bottom right of the browser window or the arrow at the top of the Data Panel. (Note that you also can close the Data Browser by clicking anywhere in the main Narrative panel in the center.)
Notice that your Data Panel now shows the annotated genomes that you added to your Narrative. You can find out more about these datasets by mousing over them and clicking the “…” that appears. An expanded view of the objects will open with options for exploring the data, downloading it, and more. Please see the Explore Data section of the Narrative Interface User Guide for more information.
Step 2. Add and run the app
Now that you have your input data, you can add the Compare Genomes from Pangenome app to your Narrative. Take a closer look at the Apps Panel directly below your data. Notice that when you click on one of the Genome objects in your Data Panel, the list of apps are filtered so that only those that take a genome as input or generate one as output are displayed.
You can search for apps using the search box at the top of the Apps Panel or just scroll until you find the one you want. Locate the Compare Genomes from Pangenome app and click on its name or icon to add it as a new cell in the main Narrative panel.
To run the app on the sample genomes that you copied, you must first fill in the fields in each step in the app cell. The detailed parameters for each app are described in the individual app details pages. For this app, the two steps are:
The first step in this app, Compute Pangenome, takes either a genome set ID or one or more genomes as input. Since we are using the two genomes we added, we can leave the Genome Set ID blank. (Notice that this field doesn’t have a red arrow next to it, meaning that it’s not a required field.)
In the second field, Genome(s), select the E. coli K-12 genome from the pulldown list. Next, add the W3110 genome by clicking on the +add another Genome(s) button. Note that the import order of the genomes will affect how the output is displayed once the app has run.
Now we need to enter a Pangenome ID, which will become the name for our Pangenome object. Here, we will use “Test_Pangenome” as the ID. Once entered, the name is automatically filled in for the first field in Step 2. Finally, enter “Test_Comparison” as a name for the Genome Comparison Object.
Notice that as you fill in the required parameter fields, the red arrows next to those fields change to green checkmarks. Once all required fields have a green checkmark, the app is ready to run.
Click Run to launch the job that computes the pangenome (the first step of the app). Once the job is initiated, a blue box will appear around Step 1, signifying that it is running. Also, a message at the bottom of the app cell will indicate the job was submitted. Depending on the queue size, the app should take only a few minutes to run. You can check the status of your job by clicking the Jobs tab at the top of the left sidebar.
Step 3. Look at the output
When the app completes, two new data objects—the Pangenome and the Genome Comparison—will appear in your Data Panel. In addition, two output cells will appear in the main Narrative panel. The first output cell summarizes the data in the Pangenome object, and the second summarizes the data in the Genome Comparison object.
The pangenome output has three tabs for reviewing the data: Overview, Shared homolog families, and Protein families.
If you examine the output of the Pangenome object, you can get a feel for the protein families that are common to the genomes and those that represent singletons. Under the Protein families tab, you can sort the table columns by clicking on their headings. To sort multiple columns simultaneously, shift-click on the column headings.
Browsing through the Genome Comparison output can provide information on the correspondence between functions and families. This is a useful first step for editing family membership and finding protein families that contain members with differing annotations. Such an approach is often useful for finding missing reactions in modeling.
Step 4. Download the results
You can download the Pangenome object in several formats: TSV (tab-separated values), Excel, or JSON. Open the expanded view of the object in the Data Panel, then click on the Export/Download data icon to see the download options.
Note to PC users: If downloading to Excel, the data will be placed into a zipped folder whose name (or path) can be long, depending on the data object’s name and type. If the folder path becomes too long, Windows may not be able to open it. Try copying or moving the file to a folder or directory that has a shorter path if you encounter problems.
You currently can download the Genome Comparison object in JSON format.
Typically, a biologist would want to create a pangenome to narrow down the candidate genes in a given strain that may be causing a particular phenotype. We will demonstrate such a use case by creating and comparing a pangenome for three strains of Escherichia coli: E. coli K-12, E. coli W3110, and E. coli O157:H7. E. coli K-12 and W3110 are common, benign laboratory strains, but O157:H7 is special because, although nearly identical to the others, its genome encodes additional proteins that cause bloody diarrhea. We will see if the Compare Genomes from Pangenome app can help us find these genes.
For this use case, we will find the three E. coli genomes in KBase’s public data collection. Access the Data Browser by clicking on the Add Data (or “+”) button in the Data Panel. Click the Public tab in the browser slideout and make sure Genomes is selected as the data type. Search for the following strains and add them to your Narrative. (Note: If you are still working in the Narrative that you created by following the point and click instructions above, then you will need to add only the O157:H7 genome.)
Add and run the app
Whether you are working in a new Narrative or the one created earlier in this tutorial, you will need to add a new Compare Genomes from Pangenome app cell to your Narrative. Find this app in the Apps Panel, add it, and fill in the fields.
In this case, we will leave the first field, Genome Set, blank. In the second field, Genome(s), add each of the three E. coli genomes. Since the input order of the genomes affects how the output is displayed, add the genomes in this order: O157:H7, W3110, and K-12.
Next, enter “E_coli_pangenome“in the Pangenome ID field. Again, this output object from Step 1 of the app will automatically populate the first field of Step 2.
We will call our Genome Comparison “E_coli_comparison” and then click Run.
The job should complete in about 3 minutes and produce two output datasets: a Pangenome object and a Genome Comparison object. Output cells for each appear in the main Narrative panel and contain information about the objects.
The pangenome output has three tabs: Overview, Shared homolog families, and Protein families. For our purposes, there are several noteworthy results. First, under the Overview tab, notice that the pathogenic O157:H7 has a much higher number of proteins (1001) that are singletons compared to the other two benign strains. Presumably, these are the genes that confer pathogenicity.
Next, under Shared homolog proteins, notice that O157:H7 has the smallest number of homolog families (3766) compared to the benign strains, which are more closely related (3923 and 3915, respectively). The data under the Protein Families tab are organized based on the first genome that was entered. In our case, this was O157:H7.
The genome comparison output cell contains similar information. It lists the genomes and their number of protein families, along with gene functions common between the three strains. Note the lower number of shared families and functions in the first row corresponding to O157:H7 (G0) vs. G1 and G2.
The Functions and Families tabs contain the data organized by function (gene annotation) and family membership, respectively. Note that a family can contain proteins with more than one function if the proteins meet the k-mer-based similarity criteria.
Normally, a researcher would want to closely examine the protein families that were generated, particularly if there were many genomes involved in the comparison. However, for the purposes of this brief tutorial, we will bypass a thorough examination.
Locate once more the output cell for the Pangenome object and click on the Protein families tab. Enter “toxin” in the search bar and sort by ID (which is the protein ID) so that the “kb|g.3398” genomes are displayed first. kb|g.3398 is the KBase identifier for the O157:H7 genome.
You can click on a gene ID in an output table to see information about the gene, including which genome it belongs to. This information will open in a new tab within the table. Once you’ve looked at the info in the new tab and click to close it, a popup window will ask you to confirm the deletion. Don’t worry—this just means you are deleting the new tab, not the gene ID; it will still remain in your output table.
Finally, if you browse through the pages of sorted output, you will see important toxin genes (including “Enterotoxin”) that are found only in O157:H7 and contribute to its pathogenicity.