Import a file in TSV format from your staging area with new annotations to add to an existing genome
The Bulk Import Annotations app is a wrapper around the Import Annotations tool, allowing for the addition of functional annotations from external/third-party annotation tools to an existing genome object. Consult the Import Annotations documentation for more details about its core functionality.
You must have a KBase Genome object in your Narrative to import annotations into. You can import your own genome or add one from public data. If you are starting with a Fasta file or "Assembly" type object, first run an app that will produce gene calls such as RAST or Prokka. Note, if you generate gene calls within KBase, it is important to use these gene calls and Gene IDs when annotating through external sources. To download a protein fasta file of your KBase gene calls, use the Text Reports - Genome app to download an mRNA or CDS fasta file to ensure your imported Gene IDs match the Gene IDs in the genome object.
The Staging Area
The staging area is a user's uploaded files area, and is unique to each KBase user. To add your TSV file to the staging area, click the "Add Data" button (or the red plus sign) and drag the file from your computer to the "Import" tab. At this point, it is ready for the Import App - there is no need to further import the TSV as you might for other file types.
The Annotation file
The Bulk Import Annotations app expects a simple four-column tab-separated value (TSV) file with gene IDs in the first column, the annotation term in the second column, annotation type in the third, and a short description of the annotation source in the fourth. These are the annotation types currently supported:
- "KO" - KEGG orthologs
- "RO" - KEGG reactions
- "EC" - Enzyme Commission numbers
- "SSO" - SEED Subsystem Ontology
- "META" - MetaCyc reaction identifiers
- "MSRXN" - ModelSEED reaction identifiers
- "GO" - Gene Ontology
Here is an example of what a valid annotation file might look like:
gene_1 K00001 KO blastKOALA gene_1 EC:184.108.40.206 EC deepEC gene_2 KO blastKOALA gene_3 K00035 KO blastKOALA gene_3 K12191 KO blastKOALA gene_3 0019151 GO deepGO gene_3 K00035 KO kofamKOALA gene_3 EC:220.127.116.11 EC deepEC ... gene_3411 K03667 KO blastKOALA
This Bulk Import Annotations app will create an annotation event for each unique combination of annotation type (column 3) and description (column 4). The above example will iteratively run 4 instances of the Import App (EC+deepEC, KO+blastKOALA, KO+kofamKOALA and GO+deepGO), resulting in one new genome object. There is some flexibility with the TSV file to accommodate direct output from annotation tools:
- The order of the rows does not matter, and the TSV can be sorted by gene ID, description, unsorted, etc. and the results will be the same
- Additional columns will be ignored, but will not cause the app to fail (as long columns 1-4 are as defined above)
- Any header or comments lines can usually be left in place, and will typically be ignored as a line with an unrecongnized gene ID
- Rows with empty term fields will be ignored (e.g. gene_2 in above example)
- Multiple terms to be added to one gene must be on separate lines, even if they are a part of the same annotation event (e.g. gene_3 in above example)
- Gene IDs are typically locus tags. If the genome object the annotations are to be added to have alias gene IDs, those can be used in column 1 of the TSV file. Make sure to check the output report for any gene IDs that were not recognized.
- Commonly used prefixes for the annotation terms - such as "EC:" in front of EC numbers - are not required, but are generally tolerated by the Import app. Make sure to check the output report for any annotation terms that were not recognized.
A new genome object will be created and named according to the user defined "output name". This genome object will have all prior annotation events, as well as the new one from the staging area.
Each annotation event is stored separately in the genome object, and each of the events has a unique description field. Try to keep these as concise as possible to describe the source of the annotations.
It is best practice to make sure all descriptions within a genome object are short and unique. This may not fit all workflows, so internally the descriptions will have the ontology type and a time stamp appended to make sure they are truly unique. Be aware, however, that downstream tools (e.g. the compare, merge and metabolic modeling apps) may produce reports with only the description field and may therefore be ambiguous.
The "add_ontology_summary.html" report gives some statistics on how many features were in the TSV file, and how many Gene IDs did and did not match the genome object. Ideally, there should be no Gene IDs that were not found in the genome, so be sure to inspect any Gene IDs returned in this report to identify why they were not added. If necessary, correct the Gene IDs in your TSV file, then re-upload to the staging area. You can "reset" the Import App to re-import your new TSV. Likewise, make sure to check for any annotation terms which were not recognized. A common source of unrecognized annotation terms is deprecated terms which may still be in use by some annotation tools. For example, some old EC numbers may have been retired or merged with others by the Enzyme Commission, but may still be generated by some annotation tools. You might be able to identify the updated terms that should be used instead.
By default, the Bulk Import App runs the Compare Annotations app to produce the "get_ontology_summary.html" report and several interactive plots. Refer to the Compare Annotations app documentation for further explanations.
Team members who developed & deployed algorithm in KBase: Jeffrey Kimbrel, Patrik D'haeseleer, Chris Henry. For questions, please contact us.
Module Commit: ec971d114d57942cef73dc2980c8faf48cea7afe