Compute Pangenome | KBase App

Allows users to compute a pangenome from a set of individual genomes.

The Compute Pangenome App allows users to group all proteins present in an input set of genomes into families based on sequence homology. A Pangenome in KBase lists of all distinct protein families identified in a set of input genome sequences, as well as the IDs of the proteins present in each family. This grouping permits a powerful set of comparisons to be performed, such as evaluating the consistency of functional assignments for highly homologous proteins, or evaluating the extent to which each protein family has been conserved across the entire input set of genomes. Protein families present in all input genomes are said to be "core" protein families, while non-conserved families are peripheral, sometimes termed accessory proteins. These proteins are likely linked to more phylogenetically recent adaptations by a species to its environment. Protein families present in any of the input genomes constitute the pangenome of this set.

Prior to using this tool, we strongly encourage users to build a phylogenetically closely related set of genomes using the Insert Genome into Species Tree App. Ideally, the set of genomes should share properties (e.g., belonging to a common taxonomic lineage, having similar core metabolisms) because the pangenome analysis will elucidate similarities and differences in the protein families across this set of genomes. It is important that all the genomes included in the input set be phylogenetically close (e.g. same family level: ~90% 16S nucleotide identity; ~80% whole-genome aligned fraction identity).

The current KBase algorithm to compute a protein pangenome is a k-mer-based method, which offers the advantage of greater speed at the cost of lower resolution of protein sequence differences. Specifically, we identify all the distinct k-mers of length 8 occurring in each protein of each genome. If a k-mer occurs in more than five proteins within a single genome, we remove the k-mer from consideration. Next, if we find that two proteins either within the same genome or across multiple genomes share at least five k-mers in common, we lump the proteins together into a single protein family, and we lump the k-mer databases for the proteins together into a single k-mer database for the protein family. We proceed in this manner, analyzing additional proteins and building up our list of distinct protein families and their associated k-mer databases on the fly. Once all genomes have been analyzed in this way, we rescore every protein family, breaking up families that are overly diverse and recomputing the number of k-mer matches between each protein and its assigned family. All data is then stored in the Pangenome object.

One limitation of a protein k-mer based approach is that it does not work for distantly related species. For example, while B. subtilis and E. coli share many proteins with common functions, there are very few k-mers of length 8 shared between proteins in these species. For this reason, our k-mer-based protein family identification algorithm will fail to construct accurate families for more distant organisms. In this analysis, it is important that all the genomes included in the input set be phylogenetically close (e.g. same family level: ~90% 16S nucleotide identity; ~80% whole-genome aligned fraction identity).

The output for this App consists of a Pangenome object and a summary data table with three tabs displayed in the Narrative. The first tab corresponds to an overview of the results including the total number of proteins assigned to families and singletons and the list of genomes considered in the analysis. The second tab, Shared homolog families , shows a table of counts of protein families found in common between all pairs of genomes in the analysis. The third tab, Protein families , displays all individual sequence families determined in the pangenome analysis and includes putative functional assignments, the number of proteins assigned to each family, and the number of genomes those proteins were found in.

Team members who developed & deployed algorithm in KBase: Chris Henry, Janaka Edirisinghe, Sam Seaver, and Neal Conrad. For questions, please contact us.

Related Publications

Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163 , https://www.nature.com/articles/nbt.4163

App Specification:

https://github.com/kbaseapps/GenomeComparison/tree/d1a44addac7a87e700b65df0c209aa8b9183cacf/ui/narrative/methods/build_pangenome

Module Commit: d1a44addac7a87e700b65df0c209aa8b9183cacf