Compare Genomes from Pangenome

Compare isofunctional and homologous gene families for all genomes in a Pangenome.

In this genome comparison algorithm, all of the proteins present in an input Pangenome object are compared based on their function and sequence. This comparison groups proteins with similar sequences into families with putative functions and can be used to identify groups of homologous proteins with inconsistent annotations (i.e. correct annotation errors), or identify consistently annotated proteins that lack significant sequence similarity. Generally, this comparison method is valuable for: (i) assessing the extent of gene conservation among an input set of genomes, (ii) understanding how genomes have evolved and adapted to their distinctive environments, and (iii) identifying novel genes and interesting biology in the context of related genomes.

This comparison builds on the sequence-based homologous protein families identified in the input Pangenome object, extending the data by also identifying protein families with putatively similar functions. Protein families with putatively similar functions are constructed by looking for sets of proteins with identical functional assignments. This implementation of the protein family construction relies on the use of a consistent annotation ontology and annotation with the most specific terms available among all genomes included in the input set.

The output from this App is a table with four tabs: (i) Overview, (ii) Genomes, (iii) Functions, and (ix) Families. The Overview tab indicates the genome comparison object name, the genome comparison workspace name, the number of total identified core functions, the number of identified core protein families, the KBase-specific protein comparison object ID, the object owner/author, and a timestamp for the creation of the object. The Genomes tab indicates a count of the number of protein families and number of protein function groups found within the queried genomes. The Functions tab indicates the functions identified across the query genomes; specifically, the function name and its associated subsystem, primary and secondary class. In addition, the Functions tab indicates the distribution for each function across each of the query genomes, including the total families and genes identified. The Families tab indicates the distribution of each gene family across each of the query genomes.

Team members who developed & deployed algorithm in KBase: Chris Henry and Matt DeJongh. For questions, please contact us.

Related Publications

Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R. The microbial pan-genome. Curr Opin Genet Dev. 2005;15: 589 594. doi:10.1016/j.gde.2005.09.006 , http://www.ncbi.nlm.nih.gov/pubmed/16185861
Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial pan-genome. Proc Natl Acad Sci U S A. 2005;102: 13950 13955. doi:10.1073/pnas.0506758102 , http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1216834/
Rasko DA, Rosovitz MJ, Myers GSA, Mongodin EF, Fricke WF, Gajer P, et al. The Pangenome Structure of Escherichia coli: Comparative Genomic Analysis of E. coli Commensal and Pathogenic Isolates. J Bacteriol. 2008;190: 6881 6893. doi:10.1128/JB.00619-08 , http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2566221/

App Specification:

https://github.com/kbaseapps/GenomeComparison/tree/d1a44addac7a87e700b65df0c209aa8b9183cacf/ui/narrative/methods/compare_genomes

Module Commit: d1a44addac7a87e700b65df0c209aa8b9183cacf