Search for protein matches to an input nucleotide sequence.
This App performs a nucleotide-protein (translated protein sequence alignment) BLASTx Search using NCBI's BLAST+ (version 2.11.0).
BLASTx is a translated nucleotide sequence (the query) search against a protein sequence database (the subject, a.k.a. target, sequences). The KBase implementation permits single nucleotide queries against subjects that are protein-coding genes in a Genome object, in the Genome members of a GenomeSet or SpeciesTree, the genes in an Annotated Metagenome Assembly, or the features in a FeatureSet. The results of the search are displayed as a table, saved to a downloadable text file, and saved as a KBase FeatureSet object for later use.
All output formats respect the e-value cutoff threshold. The on-screen table and downloadable files give the user the opportunity to examine the consequences of the other three thresholds (percent identity, bit score, and alignment coverage). On-screen, the proteins that fail to pass one or more of these three thresholds will appear in a gray line with the specific failure highlighted in red. The downloadable files give users the most flexibility for exploring alternative thresholds. All hits that are below the e-value threshold are included in the downloadable text files and users can examine all of the cutoffs without having to rerun the App. Several NCBI BLAST formats are supported in the downloadable files (discussed below under extra text output).
At this time, KBase does not have a database equivalent of NR. Large GenomeSets for searching can be created through the insertion of genomes into a species tree, annotation of an AssemblySet, or adding to and/or merging GenomeSets. Several Apps are available to support these set operations.
Your input must provide either a query DNA sequence or an input query object, and it must contain a single DNA nucleic acid sequence. At this time, KBase does not support multiple query DNA sequences. The query can be in the form of a SequenceSet object or a single nucleic acid sequence. An amino acid sequence will produce an error message.
Input Query Object: You must provide either a query DNA sequence or an input query object, and it must contain a single nucleic acid sequence. A valid query object is a SequenceSet object with a single nucleotide sequence.
Input Query DNA Sequence: If you don't provide an Input Query Object, you must copy-and-paste in a query DNA sequence. The format can be with or without a Fasta header line. If this query DNA sequence is used, you must also supply an output name for the single-element SequenceSet object that will be saved. The resulting SequenceSet can then be used in subsequent blasts runs.
Search Targets: The search database must be an object in your Narrative containing protein sequences. It may be a FeatureSet of genes, a Genome or a GenomeSet, the Genomes in a SpeciesTree, or an Annotated Metagenome Assembly. More than one object may be added to the Search Targets. The App will automatically generate a database from the narrative object for BLASTx.
E-value: This sets the maximal e-value threshold for the reported search hits. Hits with e-values above this threshold do not get reported in any of the output formats, i.e., the on-screen table, the text downloads, or the save FeatureSet.
The following three thresholds only affect the saved FeatureSet object:
- Bit Score: This bounds the bit score for the weakest hit to include in the FeatureSet output object. Hits below this threshold are highlighted in red in the on-screen table. Typically, hits with bit scores below 50 are not to be trusted (as are hits with bit scores above 50!).
- Sequence Identity Threshold (%): This bounds the percent sequence identity between the query and each hit for inclusion in the FeatureSet output object. Identity is calculated from the amino acid alignment. The value should be between 1-100. Hits below this threshold are highlighted in red in the on-screen table.
- Alignment Coverage Threshold (%)(advanced): This bounds the percent alignment coverage (portion of the query nucleotide sequence length covered by the hit protein sequence in the alignment) for inclusion in the FeatureSet output object. The value should be between 1-100. Hits below this threshold are highlighted in red in the on-screen table.
Max Accepts(advanced): Hard limit on how many hits to report. The default is 1000.
Allow Mistranslation(advanced): It sometimes happens that a eukaryotic contig is mixed in with bacterial or archaeal contigs, such as in a metagenome assembly. Some methods will annotate these genes correctly but the correct genetic code (e.g. 1 or 4) is not passed through to generation of the BLAST database of the target genes and the Bacterial and Archael code 11 will be used instead. For example, if it should be genetic code 4, this can lead to an internal STOP (aka TER) codon instead of the correct tryptophan (W). This flag will suppress translation and inclusion of such genes from the BLAST search database if an internal STOP is found. The default is to write mistranslations. Note: if an input gene length is not a multiple of 3, such genes are never translated. Such frameshifts or intron splicing must be handled upstream of this App.
Extra Text Output format(advanced) NCBI BLAST has several defined output formats (in the section called outfmt). Among them, the BLAST m=7 (tab-delimited table) text output format is automatically generated and is available for download, so should not be redundantly included here. A user may request up to one extra format to be generated and made downloadable. These include:
- 0 Pairwise
- 1 Query-anchored showing identities
- 2 Query-anchored no identities
- 3 Flat query-anchored, show identities
- 4 Flat query-anchored, no identities
- 5 XML Blast output
- 8 Text ASN.1
- 9 Binary ASN.1
- 10 Comma-separated values
- 11 BLAST archive format ASN.1
BLAST Hits Object: BLAST hits (proteins) that pass all the user-defined filters are saved in an output FeatureSet. This field is for the name of the new FeatureSet.
Output HTML Table: The on-screen table includes all the BLAST hits that meet the e-value cutoff threshold. It includes several columns commonly found in BLAST output and includes a graphic with the region of the query covered by the BLAST alignment. The table gives users the opportunity to explore the consequences of the other three thresholds (percent identity, bit score, and alignment coverage). On-screen, the hits that exceed these thresholds are included but appear in a gray line with the threshold that was not met highlighted in red. This gives users the opportunity to refine their thresholds, rerun the App, and recreate the output FeatureSet.
Downloadable files: The downloadable files include all the BLAST hits that meet the e-value cutoff threshold. This gives the user the opportunity to explore the consequences of the other three thresholds (percent identity, bit score, and alignment coverage). After download, the thresholds can be explored without rerunning the App. By default, the BLAST output is automatically available for download in a tab-delimited (m=7, formerly m=8) format. Up to one additional format can be selected. The additional formats are found in the advanced parameters as Extra Text Output format . These formats are not altered from the direct output from the BLAST run.
Output Query Object: If the Query DNA Sequence was used above, it will be saved as SequenceSet object with a single nucleotide sequence. You must supply a name for this new object.
The error message No sequence found in fasta_str or local variable 'appropriate_sequence_found_in_one_input' referenced before assignment is a sign that the query DNA sequence may not be nucleotides. It might be an amino acid sequence which doesn t work with this app.
- Altschul SF, Madden TL, Sch ffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25: 3389 3402. doi:10.1093/nar/25.17.3389 , https://academic.oup.com/nar/article/25/17/3389/1061651
- Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10: 421. doi:10.1186/1471-2105-10-421 , https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-421
Module Commit: 791f72df62105af2c74f436e8f3452c932e8db68