Annotate Domains in a Genome - v1.0.10

DomainAnnotation

v.1.0.10

By: jmc, psnovichkov, rsutormin, dylan

Launch

Annotate a Genome object with protein domains from widely used domain libraries.

This App identifies protein domains from widely used domain libraries. It requires a Genome as input, which must already have annotated protein-encoding genes (e.g., those identified using the Annotate Microbial Genome or Annotate Microbial Assembly Apps).

The user must choose one of the following sets of models with which to annotate their Genome:

All domain libraries (details of each set are listed below).
COGs (Clusters of Orthologous Groups) from the NCBI conserved domains database (CDD) version 3.19.
NCBI's CDD models from the NCBI conserved domains database (CDD) version 3.19. This dataset includes only the NCBI-curated domains (including the structural motif "sd" models).
SMART (Simple Modular Architecture Research Tool) version 6.0, from the NCBI conserved domains database (CDD) version 3.19.
PRK (Protein Clusters version 6.0, from the NCBI conserved domains database (CDD) version 3.19.
Pfam version 35.0 hidden Markov models.
TIGRFAMs version 15.0 hidden Markov models, from the J. Craig Venter Institute.
NCBI Prokaryotic Genome Annotation Pipeline (PGAP) version 8.0 hidden Markov models, from the NCBI.

For the first four libraries above (COGs, CDD, SMART and PRK), KBase runs RPS-BLAST version 2.13.0, from the BLAST+ package at NCBI, identifying all domain hits with an E-value of 10^-4 or better.

For the three HMM libraries (Pfam, TIGRFAMs and NCBIfam), KBase runs HMMER version 3.3.2, identifying all domain hits at least as significant as the family-specific trusted cutoff identified by the curators of each model.

The annotation job may run for a few hours depending on the total number of libraries selected and/or the size of the genome. When the annotation job finishes, a DomainAnnotation object will be stored in your data panel, which can be used to browse the domains that were identified in your genome.

Annotate Domains in a Genome Output
The output report currently consists of two tabs:

Overview: this tab lists the Genome annotated, the model set used, the number of protein-encoding genes that were annotated, and the total number of annotated domains.
Domains: this tab lists the domains that were annotated. It consists of 4 columns:
- Domain: this column lists the name of the domain and a link to more information about the domain.
- Description: this column lists a detailed text description of the domain.
- # Genes: this column lists the number of protein-encoding genes in this genome within this domain annotation.
- Genes: This column lists the protein-encoding genes with this annotation. Note that each gene name is a link that will create a new tab listing all the annotated domains for that specific gene.

The user can download the annotations in CSV (comma separated values) format. The fields in this CSV file are as follows:

Contig - Identifier of the contig containing an annotated feature.
Feature - Identifier of the protein-encoding gene feature annotated with domains.
Feature Start in Contig - 1-indexed position of where the annotated feature starts in the contig.
Feature End in Contig - 1-indexed position of where the annotated feature ends in the contig.
Feature Direction in Contig - Character indicating whether the annotated feature is on the '+' or '-' strand.
Domain Accession - Accession of the domain annotated in the feature.
Domain Start in Feature - 1-indexed amino acid position indicating where the annotated domain starts, relative to the beginning of the feature.
Domain End in Feature - 1-indexed amino acid position indicating where the annotated domain ends, relative to the beginning of the feature.
E-value - E-value for each domain annotation, as returned by the annotation method (HMMER or RPS-BLAST).
Bit Score - Bit score for each domain annotation, as returned by the annotation method (HMMER or RPS-BLAST).
Domain Coverage - Fraction of the length of the protein covered by this domain annotation.
Domain Description - Description of this domain.

Team members who developed & deployed this App in KBase: John-Marc Chandonia, Roman Sutormin, and Pavel Novichkov. For questions, please contact us.

Related Publications

Altschul SF, Madden TL, Sch ffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25: 3389 3402. doi:10.1093/nar/25.17.3389 , https://academic.oup.com/nar/article/25/17/3389/1061651
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10: 421. doi:10.1186/1471-2105-10-421 , https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-421
Eddy SR. Accelerated Profile HMM Searches. PLOS Computational Biology. 2011;7: e1002195. doi:10.1371/journal.pcbi.1002195 , https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002195
El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, Qureshi M, Richardson LJ, Salazar GA, Smart A, Sonnhammer ELL, Hirsh L, Paladin L, Piovesan D, Tosatto SCE, Finn RD. The Pfam protein families database in 2019. Nucleic Acids Research. 2019;47: D427 D432. doi:10.1093/nar/gky995 , https://academic.oup.com/nar/article/47/D1/D427/5144153
Haft DH, Selengut JD, Richter RA, Harkins D, Basu MK, Beck E. TIGRFAMs and Genome Properties in 2013. Nucleic Acids Res. 2013;41: D387 D395. doi:10.1093/nar/gks1234 , https://academic.oup.com/nar/article/41/D1/D387/1070451
Letunic I, Bork P. 20 years of the SMART protein domain annotation resource. Nucleic Acids Res. 2018;46: D493 D496. doi:10.1093/nar/gkx922 , https://academic.oup.com/nar/article/46/D1/D493/4429069
Letunic I, Doerks T, Bork P. SMART: recent updates, new developments and status in 2015. Nucleic Acids Res. 2015;43: D257-260. doi:10.1093/nar/gku949 , https://academic.oup.com/nar/article/43/D1/D257/2439521
Marchler-Bauer A, Bo Y, Han L, He J, Lanczycki CJ, Lu S, et al. CDD/SPARCLE: functional classification of proteins via subfamily domain architectures. Nucleic Acids Res. 2017;45: D200 D203. doi:10.1093/nar/gkw1129 , https://academic.oup.com/nar/article/45/D1/D200/2605748
Selengut JD, Haft DH, Davidsen T, Ganapathy A, Gwinn-Giglio M, Nelson WC, et al. TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes. Nucleic Acids Res. 2007;35: D260-264. doi:10.1093/nar/gkl1043 , https://academic.oup.com/nar/article/35/suppl_1/D260/1088023
Tatusov RL, Koonin EV, Lipman DJ. A Genomic Perspective on Protein Families. Science. 1997;278: 631 637. doi:10.1126/science.278.5338.631 , https://www.ncbi.nlm.nih.gov/pubmed/9381173

App Specification:

https://github.com/kbaseapps/DomainAnnotation.git/tree/093b943ead242d24227978d1df0b713d067beb89/ui/narrative/methods/annotate_domains_in_a_genome

Module Commit: 093b943ead242d24227978d1df0b713d067beb89