Search for protein matches to an input protein sequence.
This method performs a protein-protein (protein sequence alignment) BLASTp Search using NCBI's BLAST+ (version 2.11.0).
BLASTp is a protein sequence search against a protein sequence database. The KBase implementation is restricted to searches of protein-coding genes in a Genome object, the Genome members of a GenomeSet or SpeciesTree, the genes in an Annotated Metagenome Assembly, or the features in a FeatureSet. The results of the search are displayed as a table, saved to a downloadable text file, and saved as a KBase FeatureSet object for later use.
All output formats respect the e-value cutoff threshold. The on-screen table and downloadable files give the user the opportunity to examine the consequences of the other three thresholds (percent identity, bit score, and alignment coverage). On-screen, the proteins that fail to pass one or more of these three thresholds will appear in a gray line with the specific threshold that was not met highlighted in red. The downloadable files give users the most flexibility for exploring alternative thresholds. All hits that are below the e-value threshold are included in the downloadable text files and users can examine all of the cutoffs without having to rerun the App. Several NCBI BLAST formats are supported in the downloadable files (discussed below under Extra Text Output ).
At this time, KBase does not have a database equivalent of the NCBI's NR. Large GenomeSets for searching can be created through the insertion of genomes into a species tree, annotation of an AssemblySet, or adding to and/or merging GenomeSets. Several Apps are available to support these set operations.
Your input must provide either a query protein sequence or an input query object, and it must contain a single protein amino acid sequence. At this time, KBase does not support multiple query protein sequences. The query can be in the form of a SequenceSet object (likely saved by a previous run of BLASTp) or a single amino acid sequence. A nucleotide sequence may produce output (given that ACGT are all viable amino acid characters), but more than likely it will produce an error message.
Input Query Object: You must provide either a query protein sequence or an input query object, and it must contain a single amino acid sequence. A valid query object is a SequenceSet object with a single amino sequence.
Query Protein Sequence: If you don't provide an input query object, you must copy-and-paste in a query protein sequence. The format can be with or without a Fasta header line. If this query protein sequence is used, you must also supply an output name for the single-element SequenceSet object that will be saved. The resulting SequenceSet can then be used in subsequent BLASTp runs.
Search Targets: The search database must be an object in your Narrative containing protein sequences. It may be a FeatureSet of genes, a Genome or a GenomeSet, the Genomes in a SpeciesTree, or an Annotated Metagenome Assembly. More than one object may be added to the Search Targets. The App will automatically generate a database from the narrative objects for BLASTp.
E-value: This sets the maximal e-value threshold for the reported search hits. Hits with e-values above this threshold do not get reported in any of the output formats, i.e., the on-screen table, the text downloads, or the saved FeatureSet.
The following three thresholds only affect the saved FeatureSet object:
- Bit Score: This bounds the bit score for the weakest hit to include in the FeatureSet output object. Hits below this threshold are highlighted in red in the on-screen table. Typically, hits with bit scores below 50 are not to be trusted (as are hits with bit scores above 50!).
- Sequence Identity Threshold (%): This bounds the percent sequence identity between the query and each hit for inclusion in the FeatureSet output object. Identity is calculated from the amino acid alignment. The value should be between 1-100. Hits below this threshold are highlighted in red in the on-screen table.
- Alignment Coverage Threshold (%)(advanced): This bounds the percent alignment coverage (portion of the query protein sequence length covered by the hit protein sequence in the alignment) for inclusion in the FeatureSet output object. The value should be between 1-100. Hits below this threshold are highlighted in red in the on-screen table.
Max Accepts(advanced): Hard limit on how many hits to report. The default is 1000.
Allow Mistranslation(advanced): It sometimes happens that a eukaryotic contig is mixed in with bacterial or archaeal contigs, such as in a metagenome assembly. Some methods will annotate these genes correctly but the correct genetic code (e.g. 1 or 4) is not passed through to generation of the BLAST database of the target genes and the Bacterial and Archael code 11 will be used instead. For example, if it should be genetic code 4, this can lead to an internal STOP (aka TER) codon instead of the correct tryptophan (W). This flag will suppress translation and inclusion of such genes from the BLAST search database if an internal STOP is found. The default is to write mistranslations. Note: if an input gene length is not a multiple of 3, such genes are never translated. Such frameshifts or intron splicing must be handled upstream of this App.
Extra Text Output format(advanced) NCBI BLAST has several defined output formats (in the section called outfmt). Among them, the BLAST m=7 (tab-delimited table) text output format is automatically generated and is available for download, so it should not be redundantly included here. A user may request up to one extra format to be generated and made downloadable. These include:
- 0 Pairwise
- 1 Query-anchored showing identities
- 2 Query-anchored no identities
- 3 Flat query-anchored, show identities
- 4 Flat query-anchored, no identities
- 5 XML Blast output
- 8 Text ASN.1
- 9 Binary ASN.1
- 10 Comma-separated values
- 11 BLAST archive format ASN.1
BLAST Hits Object: BLAST hits (proteins) that pass all the user-defined filters are saved in an output FeatureSet. This field is for the name of the new FeatureSet.
Output HTML Table: The on-screen table includes all the BLAST hits that meet the e-value cutoff threshold. It includes several columns commonly found in BLAST outputs and includes a graphic with the region of the query covered by the BLAST alignment. The table gives users the opportunity to explore the consequences of the other three thresholds (percent identity, bit score, and alignment coverage). On-screen, the hits that exceed these thresholds are included but appear in a gray line with the threshold that was not met highlighted in red. This gives users the opportunity to refine their thresholds, rerun the App, and recreate the output FeatureSet.
Downloadable files: The downloadable files include all the BLAST hits that meet the e-value cutoff threshold. This gives the user the opportunity to explore the consequences of the other three thresholds (percent identity, bit score, and alignment coverage). After download, the thresholds can be explored without rerunning the App. By default, the BLAST output is automatically available for download in a tab-delimited (m=7, formerly m=8) format. Up to one additional format can be selected. The additional formats are found in the advanced parameters as Extra Text Output Format . These formats are not altered from the direct output from the BLAST run.
Output Query Object: If the query protein sequence was used above, it will be saved as SequenceSet object with a single amino acid sequence. You must supply a name for this new object.
Team members who implemented App in KBase: Dylan Chivian. For questions, please contact us.
- Altschul SF, Madden TL, Sch ffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25: 3389 3402. doi:10.1093/nar/25.17.3389 , https://academic.oup.com/nar/article/25/17/3389/1061651
- Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10: 421. doi:10.1186/1471-2105-10-421 , https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-421
Module Commit: 791f72df62105af2c74f436e8f3452c932e8db68