Insert Genomes into Species Tree App

Not yet updated for Release 3.0

The instructions in this document were written for Release 2.0. The December 2016 release (3.0) looks a bit different, though the overall operation is similar. This Narrative tutorial will be updated soon.

Description of tutorial

This tutorial describes how to create a phylogenetic tree of closely related genomes in the KBase Narrative Interface using the Insert Genomes into Species Tree app and then navigate and curate the resulting species tree and genome set.

In this tutorial, you will:

  • Find a genome to insert into a species tree using the Narrative Interface Data Browser.
  • Add the genome to your Narrative.
  • Find and add the Insert Genomes into Species Tree app into your Narrative.
  • Fill in the required parameters and run the app to generate a species tree and genome set.
  • Browse the species tree.
  • Curate the resulting genome set by removing or adding genomes.
  • Apply this app to a biological use case that involves investigating the possible evolutionary history of a group of bacterial pathogens that lack a cell wall.

Description of the app

The Insert Genomes into Species Tree app enables a user to determine evolutionary relationships between organisms based on the differences in their genomic sequences by creating both a species tree and a genome set of closely related organisms. A set of reference alignments based on 49 highly conserved Clusters of Orthologous Groups (COG) families is used to find the matching corresponding set of sequences for a specific genome. The sequences from the selected genome are then inserted into the reference alignments, the closest neighbors are extracted and concatenated, and a tree is rendered from them using FastTree2 (an approximate maximum likelihood method). Note that when inserting a genome into a species tree, if that genome is contained within the reference set of alignments used to build the tree, the tree building algorithm will duplicate that genome within the tree and the genome set generated by this app.

The 49 COG domains used by this app are:

  • COG0012, COG0012, Predicted GTPase, probable translation factor [Translation, ribosomal structure and biogenesis].
  • COG0013, AlaS, Alanyl-tRNA synthetase [Translation, ribosomal structure and biogenesis]
  • COG0016, PheS, Phenylalanyl-tRNA synthetase alpha subunit [Translation, ribosomal structure and biogenesis]
  • COG0018, ArgS, Arginyl-tRNA synthetase [Translation, ribosomal structure and biogenesis]
  • COG0030, KsgA, Dimethyladenosine transferase (rRNA methylation) [Translation, ribosomal structure and biogenesis]
  • COG0041, PurE, Phosphoribosylcarboxyaminoimidazole (NCAIR) mutase [Nucleotide transport and metabolism]
  • COG0046, PurL, Phosphoribosylformylglycinamidine (FGAM) synthase, synthetase domain [Nucleotide transport and metabolism]
  • COG0048, RpsL, Ribosomal protein S12 [Translation, ribosomal structure and biogenesis]
  • COG0049, RpsG, Ribosomal protein S7 [Translation, ribosomal structure and biogenesis]
  • COG0051, RpsJ, Ribosomal protein S10 [Translation, ribosomal structure and biogenesis]
  • COG0052, RpsB, Ribosomal protein S2 [Translation, ribosomal structure and biogenesis]
  • COG0072, PheT, Phenylalanyl-tRNA synthetase beta subunit [Translation, ribosomal structure and biogenesis]
  • COG0080, RplK, Ribosomal protein L11 [Translation, ribosomal structure and biogenesis]
  • COG0081, RplA, Ribosomal protein L1 [Translation, ribosomal structure and biogenesis]
  • COG0082, AroC, Chorismate synthase [Amino acid transport and metabolism]
  • COG0086, RpoC, DNA-directed RNA polymerase, beta’ subunit/160 kD subunit [Transcription]
  • COG0087, RplC, Ribosomal protein L3 [Translation, ribosomal structure and biogenesis]
  • COG0088, RplD, Ribosomal protein L4 [Translation, ribosomal structure and biogenesis]
  • COG0089, RplW, Ribosomal protein L23 [Translation, ribosomal structure and biogenesis]
  • COG0090, RplB, Ribosomal protein L2 [Translation, ribosomal structure and biogenesis]
  • COG0091, RplV, Ribosomal protein L22 [Translation, ribosomal structure and biogenesis]
  • COG0092, RpsC, Ribosomal protein S3 [Translation, ribosomal structure and biogenesis]
  • COG0093, RplN, Ribosomal protein L14 [Translation, ribosomal structure and biogenesis]
  • COG0094, RplE, Ribosomal protein L5 [Translation, ribosomal structure and biogenesis]
  • COG0096, RpsH, Ribosomal protein S8 [Translation, ribosomal structure and biogenesis]
  • COG0097, RplF, Ribosomal protein L6P/L9E [Translation, ribosomal structure and biogenesis]
  • COG0098, RpsE, Ribosomal protein S5 [Translation, ribosomal structure and biogenesis]
  • COG0099, RpsM, Ribosomal protein S13 [Translation, ribosomal structure and biogenesis]
  • COG0100, RpsK, Ribosomal protein S11 [Translation, ribosomal structure and biogenesis]
  • COG0102, RplM, Ribosomal protein L13 [Translation, ribosomal structure and biogenesis]
  • COG0103, RpsI, Ribosomal protein S9 [Translation, ribosomal structure and biogenesis]
  • COG0105, Ndk, Nucleoside diphosphate kinase [Nucleotide transport and metabolism]
  • COG0126, Pgk, 3-phosphoglycerate kinase [Carbohydrate transport and metabolism]
  • COG0127, COG0127, Xanthosine triphosphate pyrophosphatase [Nucleotide transport and metabolism]
  • COG0130, TruB, Pseudouridine synthase [Translation, ribosomal structure and biogenesis]
  • COG0150, PurM, Phosphoribosylaminoimidazole (AIR) synthetase [Nucleotide transport and metabolism]
  • COG0151, PurD, Phosphoribosylamine-glycine ligase [Nucleotide transport and metabolism]
  • COG0164, RnhB, Ribonuclease HII [DNA replication, recombination, and repair]
  • COG0172, SerS, Seryl-tRNA synthetase [Translation, ribosomal structure and biogenesis]
  • COG0185, RpsS, Ribosomal protein S19 [Translation, ribosomal structure and biogenesis]
  • COG0186, RpsQ, Ribosomal protein S17 [Translation, ribosomal structure and biogenesis]
  • COG0215, CysS, Cysteinyl-tRNA synthetase [Translation, ribosomal structure and biogenesis]
  • COG0244, RplJ, Ribosomal protein L10 [Translation, ribosomal structure and biogenesis]
  • COG0256, RplR, Ribosomal protein L18 [Translation, ribosomal structure and biogenesis]
  • COG0343, Tgt, Queuine/archaeosine tRNA-ribosyltransferase [Translation, ribosomal structure and biogenesis]
  • COG0504, PyrG, CTP synthase (UTP-ammonia lyase) [Nucleotide transport and metabolism]
  • COG0519, GuaA, GMP synthase, PP-ATPase domain/subunit [Nucleotide transport and metabolism]
  • COG0532, InfB, Translation initiation factor 2 (IF-2; GTPase) [Translation, ribosomal structure and biogenesis]
  • COG0533, QRI7, Metal-dependent proteases with possible chaperone activity [Posttranslational modification, protein turnover, chaperones]

For more information, please see the details page for this app.

Description of the input

This app takes one or more “Genomes” as input. In KBase, a “Genome” or “Genome typed object” is a special object type that contains the feature calls and annotation data for a genome. You can load genome data into KBase for analysis in a number of ways:

  1. Upload your own data in GenBank format from your computer.
  2. Import a GenBank file directly from NCBI using FTP within the Narrative Interface.
  3. Search for and add to your Narrative a genome already available among reference data integrated into KBase from external repositories.
  4. Use example data from the Data Browser slideout panel.
  5. Use a Genome you’ve already used in another Narrative or that another user has shared with you.
  6. Use a Genome object produced by other apps in your Narrative (such as Annotate Microbial Contigs).

This tutorial will take you through the steps for running the Insert Genomes into Species Tree app using example data from KBase’s reference data collection.

Once you’re ready to upload your own data, see the section of the Genome Data Upload and Download Guide for instructions on uploading a genome from GenBank.

Description of the output

The output of this app is a tree of related organisms.

Point and click instructions for using this app

Note: This tutorial assumes that you have already created a new Narrative. For instructions on how to accomplish this and other tasks such as finding or uploading data to your Narrative, please refer to the Narrative Interface User Guide.

Step 1. Add data that you want to analyze

Before we run the app, we need to copy or upload the needed input data. For the point and click instructions, we will start by copying an annotated genome into our Narrative from the KBase reference data collection.

First, click  the Add Data (or “+”) button in the Data Panel on the left of your screen. (If you don’t see this button, make sure you have the Analyze tab selected.) The Data Browser will slide out, with tabs that show several data sources. Choose the Public tab to see a list of publicly available KBase reference data. Genomes are displayed by default, but the data types dropdown menu allows you to search for other types of data as well.

With Genomes selected, search for “Escherichia coli str. K-12 substr. MG1655.” Add the genome to your Narrative by mousing over it and then clicking the Add button that appears to its left. (Here, we will use the MG1655 substrain with 4520 genes but feel free to use another substrain or genome if you choose.)

Exit the Data Browser by clicking either the Close button at the bottom right of the browser window or the arrow at the top of the Data Panel. (Note that you also can close the Data Browser by clicking anywhere in the main Narrative panel in the center.)

Notice that  your Data Panel now displays the annotated genome you added:

You can find out more about this genome by clicking the “…” that appears when mousing over the object in the Data Panel or by dragging it into the main Narrative panel to create a Genome Viewer cell. Please see the Explore Data section of the Narrative Interface User Guide for more information.

Step 2. Add and run the app

Now that you have your input data, you can add the Insert Genomes into Species Tree app to your Narrative. Take a closer look at the Apps Panel directly below your data.

You can search for apps using the search box at the top of the Apps Panel or just scroll until you find the one you want. Locate the Insert Genomes into Species Tree app and click on its name or icon to add it as a new cell in the main Narrative panel.

To run the app on the sample genome you copied, you must first fill out the fields in each step in the app cell. In the first field (Genome), select the newly added E. coli genome. Next, we must specify the number of neighboring genomes for the tree, choosing 10 in this example for simplicity. Now provide a name for the tree that will be generated. Here, we will use “E_coli_tree.”

Finally, click the green Run button at the top right of the app cell to launch the analysis job.

This app typically takes about 3 to 20 minutes to run, depending on how many other jobs are queued or running.

Be sure to save your Narrative frequently, using the Save button at the top right of the screen.

Step 3. Look at the output

When the job finishes, you will see an output cell below the app. Also notice your Data Panel. It now contains the Tree object.

Examine the output cell containing the species tree of the E. coli K-12 genome and its 10 closest relatives. If you click on an internal parent node, all of its children will collapse and the node will turn green. (Note that this may reconfigure the tree topology.) To display the children again, click the green node. Clicking on a terminal node will bring up (in a new browser tab) a page of information about the corresponding genome. (Note that Data Landing pages, such as this one, are still in development.)

If you click on the green Change layout button in the top right corner of the cell (you may need to scroll to see it), the tree layout will switch to a circular format.

Step 4. Download the results

Download options for the data generated by this app are still in development. We hope to make this capability available soon.

Biological use case

The Insert Genomes into Species Tree app allows users to rapidly create trees for genomes in KBase’s reference data collection without having to select the close relatives or generate the alignment and tree themselves. This capability enables users to easily and rapidly assess speciation events and to quickly select genomes for further analyses.

In this use case, we will generate a tree for the Mycoplasma, which are a group of obligate intracellular bacterial pathogens that lack a cell wall. The Mycoplasma represent an interesting use case. According to the NCBI Taxonomy Browser, they belong to a phylum-level clade called the Tenericutes because they lack a cell wall (the term “Tenericute” was coined in the early 1980s to mean soft cuticle). We will assess the evolutionary history of the Mycoplasma using the Insert Genomes into Species Tree app.

First, create a new Narrative (or continue working with the one you already created). Using the Data Browser, add a Mycoplasma capricolum genome from KBase’s public reference data.


Next, add the Insert Genomes into Species Tree app to your Narrative, and select the Mycoplasma capricolum genome from the dropdown menu for the Genome field. We will use 100 neighbors and call the resulting tree “Mycoplasma_tree.” Name the genome set “Mycoplasma_set.”

Click Run to start the analysis, which may take up to 20 minutes to run.

CompareGenomes06Be sure to save your Narrative frequently, using the Save button at the top right of the screen.

When the job completes, you will see a very large tree of Mycoplasma strains and their associated relatives.


If you collapse some shorter branches, you will notice that the Mycoplasma and other wall-less organisms (the ones with “plasma” in their names) share a branch with Lactobacillus, a low G+C Gram positive bacterium that has a cell wall. Thus the tree is suggesting that the ancestor of the Mycoplasma was an organism with a cell wall, and that during their evolutionary history, the Mycoplasmas became host-associated and the cell wall was ultimately lost.



Further analysis and next steps

The Genome Set object generated by this app can be used as input for the Compare Genomes from Pangenome app. We encourage you to take a moment to familiarize yourself with the tutorial on building a pangenome and as an exercise, see if you can curate the genome set and build a pangenome in the Narrative Interface to identify genes involved in cell wall biosynthesis.