Extract longer contigs
Filter Assembled Contigs by Length allows the user remove shorter contigs from their Assembly objects. This allows methods that analyze assemblies to run much faster on the most valuable contigs where genes are not truncated and genome context is available for longer contigs with multiple genes.
A typical protein domain is 80-200 amino acids for all alpha or all beta proteins, whereas alpha+beta and alpha/beta proteins are typically 150-400 amino acids per domain, therefore an absolute minimum length if a perfectly aligned contig is to fit a protein domain should be 300 bp. Multi-domain proteins, which are typical, should be at least 1000 bp (again, getting lucky and aligning perfectly so the contig doesn't truncate the gene). Therefore, it would be quite reasonable to at least filter contigs to 2 Kbp (which is our default) or higher if you are trying to get more than one protein per contig. If proteins are your target, you should certainly not go below 300 bp.
Designed and Implemented for KBase by Dylan Chivian ([email protected])
Configuration:
Assembly Object(s): The Assembly object(s) is a collection of assembled genome fragments, called "contigs". Their length distributions usually differ depending on the input sequence data, the assembler, and the parameterization of the assembler. This App may be run on a single Assembly, several Assemblies, or an AssemblySet object containing multiple Assemblies.
Output:
Output Object:
Output Object: The Output object will be an Assembly Object for each input Assembly. Additionally, if more than one Assembly is input, then the output will also include an AssemblySet object that contains the output Assembly object..
Output Report:
- The report indicates how many contigs were filtered for each Assembly.
Downloadable files: The Assembly objects can be accessed in the Data Pane for download as FASTA.
Related Publications
- Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163 , https://www.nature.com/articles/nbt.4163
App Specification:
https://github.com/kbaseapps/kb_AssemblyUtilities/tree/4304b6160f959fbc7302c6cf39799b5c37fd6ec6/ui/narrative/methods/run_filter_contigs_by_lengthModule Commit: 4304b6160f959fbc7302c6cf39799b5c37fd6ec6