Allows users to extract the longer contigs from Assembly objects.
Filter Assembled Contigs by Length allows users to remove shorter contigs from their Assembly objects. This allows Apps that analyze assemblies to run much faster on the most valuable contigs where genes are not truncated and genome context is available for longer contigs with multiple genes. Certain Apps like Prokka and MaxBin2 are particularly susceptible to performance issues when overloaded with a high number of contigs, so any effort to remove low-value short contigs will increase performance time and success rate.
A typical protein domain is 80-200 amino acids for all alpha or all beta proteins, whereas alpha+beta and alpha/beta proteins are typically 150-400 amino acids per domain, therefore an absolute minimum length if a perfectly aligned contig is to fit a protein domain should be 300 bp. Multi-domain proteins, which are typical, should be at least 1000 bp (again, getting lucky and aligning perfectly so the contig doesn't truncate the gene). Therefore, it would be quite reasonable to at least filter contigs to 2 Kbp (which is our default) or higher if you are trying to get more than one protein per contig. If proteins are your target, you should certainly not go below 300 bp.
Assembly Object(s): The Assembly object is a collection of assembled genome fragments, called "contigs". Their length distributions usually differ depending on the input sequence data, the assembler, and the parameterization of the assembler. This App may be run on a single Assembly, several Assemblies, or an AssemblySet object containing multiple Assemblies.
Output Object: The output object will be an Assembly object for each input Assembly. Additionally, if more than one Assembly is input, then the output will also include an AssemblySet object that contains the output Assembly object. If more than one Assembly or AssemblySet is entered, each individual assembly that is filtered will create a new Assembly object with the original_assembly_name with .min_contig_lengthXbp appended to the end (where X is the entered Min Contig Length entered by the user).
- The report indicates how many contigs were filtered for each Assembly.
Downloadable files: The Assembly objects can be accessed in the Data Panel for download in FASTA format.
Team members who developed & deployed algorithm in KBase: Dylan Chivian. For questions, please contact us.
- Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163 , https://www.nature.com/articles/nbt.4163
Module Commit: acf7ecefee20159051a34bbd3a1d2efbebc143c0