Import a file in TSV format from your staging area with new annotations to add to an existing genome
The Import Annotations app allows for the addition of functional annotations from external/third-party annotation tools to an existing genome object. Some examples of commonly used external tools include blastKOALA or deepEC, and parsing textual annotations to MetaCyc terms with Pathway Tools.
You must have a KBase Genome object in your Narrative to import annotations into. You can import your own genome or add one from public data. If you are starting with a Fasta file or "Assembly" type object, first run an app that will produce gene calls such as RAST or Prokka. Note, if you generate gene calls within KBase, it is important to use these gene calls and Gene IDs when annotating through external sources. To download a protein fasta file of your KBase gene calls, use the Text Reports - Genome app to download an mRNA or CDS fasta file to ensure your imported Gene IDs match the Gene IDs in the genome object.
Currently, the import app can only accept one type of annotation per import event. If you have multiple annotation types, these can be imported using the Bulk Import tool.
The Staging Area
The staging area is a user's uploaded files area, and is unique to each KBase user. To add your TSV file to the staging area, click the "Add Data" button (or the red plus sign) and drag the file from your computer to the "Import" tab. At this point, it is ready for the Import App - there is no need to further import the TSV as you might for other file types.
The Annotation file
The Import Annotations app expects a simple two-column tab-separated value (TSV) file with gene IDs in the first column, and the annotation term in the second column. Here is an example of what a valid annotation file might look like:
gene_1 K00001 gene_2 gene_3 K00035 gene_3 K12191 ... gene_3411 K03667
Depending on the output format from the third-party annotation tool you want to use, you may need to use a small script to adjust some of the formatting. There is some flexibility with the TSV file to accommodate direct output from annotation tools:
- Additional columns will be ignored, but will not cause the app to fail (as long as the gene IDs and annotation terms are in columns 1 and 2)
- Any header or comments lines can usually be left in place, and will typically be ignored as a line with an unrecongnized gene ID.
- Rows with empty term fields will be ignored (e.g. gene_2 in above example)
- Multiple terms to be added to one gene must be on separate lines (e.g. gene_3 in above example)
- Gene IDs are typically locus tags. If the genome object the annotations are to be added to have alias gene IDs, those can be used in column 1 of the TSV file. Make sure to check the output report for any gene IDs that were not recognized.
- Commonly used prefixes for the annotation terms - such as "EC:" in front of EC numbers - are not required, but are generally tolerated by the Import app. Make sure to check the output report for any annotation terms that were not recognized.
A new genome object will be created and named according to the user defined "output name". This genome object will have all prior annotation events, as well as the new one from the staging area.
Each annotation event is stored separately in the genome object, and each of the events has a unique description field. Try to keep these as concise as possible to describe the source of the annotations.
It is best practice to make sure all descriptions within a genome object are short and unique. This may not fit all workflows, so internally the descriptions will have the ontology type and a time stamp appended to make sure they are truly unique. Be aware, however, that downstream tools (e.g. the compare, merge and metabolic modeling apps) may produce reports with only the description field and may therefore be ambiguous.
The "add_ontology_summary.html" report gives some statistics on how many features were in the TSV file, and how many Gene IDs did and did not match the genome object. Ideally, there should be no Gene IDs that were not found in the genome, so be sure to inspect any Gene IDs returned in this report to identify why they were not added. If necessary, correct the Gene IDs in your TSV file, then re-upload to the staging area. You can "reset" the Import App to re-import your new TSV. Likewise, make sure to check for any annotation terms which were not recognized. A common source of unrecognized annotation terms is deprecated terms which may still be in use by some annotation tools. For example, some old EC numbers may have been retired or merged with others by the Enzyme Commission, but may still be generated by some annotation tools. You might be able to identify the updated terms that should be used instead.
By default, the Import App runs the Compare Annotations app to produce the "get_ontology_summary.html" report and several interactive plots. Refer to the Compare Annotations app documentation for further explanations.
Team members who developed & deployed algorithm in KBase: Jeffrey Kimbrel, Patrik D'haeseleer, Chris Henry. For questions, please contact us.
Module Commit: ec971d114d57942cef73dc2980c8faf48cea7afe