Sequence Homology Search

The Sequence Homology Search allows you to search for KBase reference genomes and genome features using a DNA or protein sequence, find matching genomes, genes or proteins, select them, and copy them to a Narrative.

Homology Search Overview

The key components of the homology search page include:

1. Sequence box – You can enter a nucleotide or protein sequence, either as a plain sequence or in FASTA format. Multiple query sequences are currently not supported.

2. Database selection – You can search your sequence against one of the following databases build from all KBase reference genomes:

  • KBase non-redundant gene sequences (NR-ffn)
  • KBase non-redundant protein sequences (NR-faa)
  • KBase genome sequences (fna)
  • Search within select genomes: opens the Advanced Options panel, which allows you to select one or more reference genomes and restrict your search to only those genomes.

The non-redundant gene and protein sequence databases are constructed by matching all identical gene or protein sequences using MD5 checksums. Only one representative sequence is included in the BLAST database. The FASTA definition line for the representative sequence summarizes the total number of identical sequences present in the database. As more and more closely related genomes are sequenced and added to the system, using non-redundant sequences makes the searches more scalable and efficient. Without non-redundant sequences, the top results to a search might all be to the identical genes/proteins from closely related genomes, preventing users from seeing any sequence variations or getting distant hits.

Based on the input nucleotide or protein query sequence entered in the box, the non-redundant gene or protein sequence database is selected automatically. You can also select a different database using the drop-down menu to enhance your search.

3. Advanced options – Allows you to select one or more reference genomes and search only against those genomes using the specified program.

Homology Search Advanced Options

The advanced options include:

      1. Genomes – Select one or more reference genomes to narrow your search. As you start typing genus, species, strain name, or KBase genome identifier, the matching reference genomes are displayed for selection. Once you select a genome of interest, use the “+” button to add another genome. You can remove a selected genome using “-” button.

        Please note that selecting genomes automatically restricts the search to only those genomes. It will override the KBase non-redundant gene or protein sequence database selected in the Database field.
      2. Search for Genomic Sequences or Genomic Features – This option allows you to search against genes or proteins, respectively. If Genomic Features is selected, gene or protein databases for genomes will be selected automatically based on the sequence used in the query.
      3. Program – This allows you to perform the search using one of the following five BLAST programs:
        • blastn: search the nucleotide database using nucleotide query
        • blastp: search the protein database using protein query
        • blastx: search the protein database using a translated nucleotide query
        • tblastn: search the translated nucleotide database using a protein query
        • tblastx: search the translated nucleotide database using a translated nucleotide query

        Note that the appropriate program is automatically selected based on the input query sequence and the selection of the database,. You can override it by selecting another program using the drop-down menu.
      4. Max hits and E value threshold – You can change the maximum number of hits displayed and the E value threshold used for filtering the results. The defaults are maximum 50 hits and E value threshold of 10.

