Create a Pangenome object by performing OrthoMCL orthologous groups construction on a set of Genomes.
Orthologs are homologs seperated by speciation events. Paralogs are homologs separated by duplication events. Detection of orthologs is becoming much more important with the rapid progress in genome sequencing.
OrthoMCL is a genome-scale algorithm for grouping orthologous protein sequences. It provides not only groups shared by two or more species/genomes, but also groups representing species-specific gene expansion families. So it serves as an important utility for automated eukaryotic genome annotation.
OrthoMCL starts with reciprocal best BLAST hits within each genome as potential in-paralog/recent paralog pairs and reciprocal best hits across any two genomes as potential ortholog pairs. Related proteins are interlinked in a similarity graph. Then, MCL is invoked to split mega-clusters. This process is analogous to the manual review in COG construction. MCL clustering is based on weights between each pair of proteins, so to correct for differences in evolutionary distance, the weights are normalized before running MCL.
OrthoMCL is similar to the INPARANOID algorithm, but is extended to cluster orthologs from multiple species. OrthoMCL clusters are coherent with groups identified by EGO, and an analysis using EC number suggests a high degree of reliability .
Overview of OrthoMCL Processing
- All-vs-all BLASTP of the proteins.
- Compute percent match length.
- Select whichever is shorter, the query or subject sequence. Call that sequence S.
- Count all amino acids in S that participate in any HSP.
- Divide that count by the length of S and multiply by 100.
- Apply thresholds to blast result. Keep matches with E-Value < 1e-5 percent match length >= 50%.
- Find potential inparalog, ortholog, and co-ortholog pairs using the Orthomcl Pairs program. These are the pairs that are counted to form the Average % Connectivity statistic per group.
- Use the MCL program to cluster the pairs into groups.
In KBase, the input to OrthoMCL is a set of genomes and/or a list of individual genomes, and the output is a Pangenome object. A pangenome is the set of protein-coding genes in all the selected organisms. It includes genes present in all organisms (core genome) and genes present only in some organisms. The advanced parameters are either options for the BLAST or MCL portions of the code.
The output cell has three tabs:
- Pangenome Summary
- The Summary tab has an overview of the genomes, their genes, homologs, families, and singleton genes.
- Shared homolog families
- The tab for Shared homolog families has a matrix of all the genomes vs all the genomes. The matrix numbers are the numbers of homolog families that are in common between the row and column genomes. The red numbers on the diagonal are the total number of homolog families in the genome.
- Protein families
- The tab for Protein families has a list of the homologous protein clusters predicted by OrthoMCL. There is a functional assignment for the cluster, a cluster number, the number of genes in the cluster, and the number of genomes that are part of the cluster. There is a search box for subsetting by keyword, and the columns can be sorted by function and cluster number. OrthoMCL numbers the clusters by the gene count with cluster1 having the most genes and the last clusters being singletons. Clicking on a cluster ID will open a new tab with details about the cluster, including the list of genes and their genomes.
In the data panel, the newly created Pangenome object can be downloaded as a tab-separated values file (TSV) or as Excel.
- Li L, Stoeckert CJ, Roos DS. OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes. Genome Res. 2003;13: 2178 2189. doi:10.1101/gr.1224503 , https://genome.cshlp.org/content/13/9/2178
Module Commit: ec78927c83921ccd6ddc670725ffecc6ab3d96da