Data Search Guide – Printable

Data Search Guide

KBase’s Data Search functionality allows users to do sophisticated searches for  data objects of interest and select targeted datasets on which to perform analyses.

The new version of the Narrative Interface also has search capability. Please see the “Explore Data” section of the Narrative Interface User Guide” for more information.

Currently searchable data includes metagenomes, genome features, genomes, and metabolic models. This list will be expanding rapidly as we add more data types to our search infrastructure. Please visit the Data Summary page for a complete listing of data types incorporated into KBase.

This guide will show you how to search, sort, filter, and transfer data objects to a Narrative for subsequent analysis with KBase apps and methods.

Note that Data Search does not yet work on user-uploaded data. Please check back frequently for updates.

 

Access the Search Page

There are two ways to access the reference data contained in KBase. You can search KBase data even if you are not logged in through a KBase account. However, unless you are logged in you will not be able to save the results of your searches and transfer them to the Narrative Interface for analysis. From the kbase.us home page, locate the “Data & Tools” dropdown menu and select “Search Reference Data.” This option works even if you do not have a KBase account.

Accessing-Search-Data-2

Registered KBase users can also access the Data Search interface from the Narrative Interface via the “Search Data” menu option in the dropdown menu located in the top left corner.

Accessing-Search-Data

In this guide, we will assume that you have signed in to the Narrative Interface (narrative.kbase.us) with your KBase username and password. If you are not familiar with the Narrative Interface, you may wish to consult the Narrative Guide for an explanation of the major components, or the Narrative Quick Start for a quick overview.

When you access Data Search, it will conduct an initial query of all of the public data in KBase. After a few seconds you should see a short list of major data categories, each with a count showing how many data objects were found in that category. This represents a real-time query of all available reference data in KBase.

The Data Search tool will appear and conduct an initial query of the entire public data store.  After a few seconds you should see a short list of major data categories. This represents a real-time query of all available data.

Please note: This tool accesses public reference data that has been loaded into KBase. It does not yet include data uploaded by KBase users for use in Narratives, even if that data has been made publicly available.

Search-Box

Key features of this page include:

  1. Search box — When you type a string (i.e., sequence of characters) into the search box and perform a search, you will get a list of all data objects found to have your search string in ANY field, such as “Scientific Name” or “Function.” The special wildcard symbol * (asterisk) represents any sequence of characters. For example, the search string *bac finds genome features for Rhodobacter sphaeroides and Bacillus subtilis, as well as data for Escherichia coli TW10509 and its associated WbaC protein.By default, an asterisk appears by itself in the search box, which matches all the searchable objects.
  2. Data categories — KBase currently has thousands of searchable metagenomes, genome features, genomes, and metabolic models. The number of data objects available for each category is listed in parentheses. Selecting one of these categories from the initial search page will display every data object in that category.
  3. Sign-in links — These links are located in the beige box at the left and also the top right corner. You will need to sign in if you want to copy your search results to a Narrative for further analysis. Once you sign in, you will see a Data Cart associated with your username on the left of the search page.

Select-Narrative

 

 

Keyword Search

To execute a keyword search, enter some text related to the data you are looking for. For example, try entering the keyword “arabidopsis” in the search box and click the search icon (the magnifying glass) or press Enter. This searches for your keyword across all categories of searchable data and returns the number of results found in each category (if any).

In this example, the search string “arabidopsis” matches some objects in three data categories, which are listed with a count of objects found.

Search

Search Tip: OR vs. AND

If your search string includes multiple words (e.g., “Bacillus subtilis”), then by default the search will return all objects that have ANY of the words in any field. In other words, there is an implicit OR between “Bacillus” and “subtilis.” To find only the results that contain ALL the words in your search string, use the keyword AND: “Bacillus AND subtilis”, or surround the text you’re looking for with quotation marks to find only exact matches to that phrase (“Bacillus subtilis”, for example, would not match the text “Bacillus” or the text “subtilis Bacillus”).

Bacillus subtilis

Bacillus AND subtilis

Search results can be refined by the category of data and filtered according to taxonomic information, genome features, publication information, or type of biological model, depending on the data object type. Clicking on a category name will display a list of search results for that category. For example, notice the Genomes category contains four data objects for the “arabidopsis” query. Click on this category to view these genomes on the results page (see image below).

Search Results

Key components of the search results page include:

  1. Summary table of results — Search results are listed in a table with several columns of information. The column headers vary depending on the data category. For Genomes, they are the Genome ID (a unique ID number assigned to each data object by KBase), Scientific Name, Domain, %GC (GC Content,) Contigs, Genes, and DNA Size. Each column can be sorted by clicking on the column header.
  2. Category navigation options — These options (shown on the left side below the header “Show Results for”) allow you to return to seeing all categories. Also, if subcategories exist for the data category you are exploring, those subcategories will appear here and are clickable, allowing you to move up or down in the category hierarchy. For instance, the Metabolic Models category has a subcategory for Flux Balance Analysis Models. A quick way to tell what category your results are in is by finding the green text in the category navigation options. In the screenshot above, notice that the word “Genomes” (under “Back to Any Category”)  is green, meaning you are currently in the Genomes category.
  3. Filters — The results page for each data category has one or more filters that can be selected to refine the search results. Each type of filter can be expanded to show possible values for that filter type. Genomes, for example, can be filtered by taxonomy.
  4. Viewing options — There are two groups of buttons above the search results: “Views” and “Items per page.” The default number of results displayed per page is 10. You can increase this number to 25, 50, or 100 by clicking one of the “Items per page” buttons at the top right, a useful option when you have many search results and want to browse them quickly.

The Views buttons display results in either a compact view (default) or an expanded view with additional information in light blue boxes below each result (see image). To see the expanded view of the Arabidopsis genome results, click the second “Views” button. For the Genomes category, the additional information is the Taxonomy lineage for each organism.

Expanded Results

Refining, Filtering, and Sorting Search Results

You can change your search query by replacing or adding to the text in the search box. Keep in mind that if you are looking at search results within one data category, any new search that you perform from this view will search only the category you are currently in. To start a new search across all categories, click the “Return to All Categories” link to access the main search page.

Let’s say we decide to search for all the plant genomes available in KBase. We click “Return to All Categories”, select the Genomes category, and then replace the “arabidopsis” search string with “Viridiplantae” (which will look for all plants plus green algae). Press Enter or the search icon to see the results (see image). Because this search was done from the Genomes category view, only genomes (and no other categories) are searched.

Viridiplantae

 

When you have many results, sorting them can help you find what you are looking for more efficiently. Depending on the type of information in a column of results, columns can be sorted by ascending or descending order or by alphabetical or reverse alphabetical order.

Suppose you want to sort the list of plant genomes by number of contigs in order to locate those genomes with the smallest number of contigs, which tend to be more completely sequenced. Click the “Contigs” column header and select “Sort ascending” from the drop-down options. Notice that when the Contigs header was selected, the color of the column header text turned purple, and the icon below it changed to indicate the type and direction of the sort.

Viridiplantae-sort

You can sort results by more than one column at a time. For example, in addition to sorting your list of plant genomes by the number of contigs, you can do a secondary sort by the DNA size of each organism. Click the “DNA Size bp” column and select ascending order. Now both column headers have purple titles and icons, and they are numbered to indicate the order in which the sorts were applied to your search results. This allows you to keep track of complicated sorts.

You can remove any sort by clicking on the column header and selecting the “Clear this sort” option from the dropdown menu.

Double-Sort

Another way to locate the search results that you’re interested in is by using filters. The Filters section at the bottom left of the search page lists one or more types of filters that can be expanded to show fields that you can select to restrict your results. Each filter has a number to the right which indicates how many results match that filter. Clicking the filter checkbox will apply the filter to your results.

Sort

Note: Eukaryotes and plants are currently absent from the taxonomy filter on the initial data search page. To activate these filtering options, do one of the following:

Sequence Homology Search

The Sequence Homology Search allows you to search for KBase reference genomes and genome features using a DNA or protein sequence, find matching genomes, genes or proteins, select them, and copy them to a Narrative.

Homology Search Overview

The key components of the homology search page include:

1. Sequence box – You can enter a nucleotide or protein sequence, either as a plain sequence or in FASTA format. Multiple query sequences are currently not supported.

2. Database selection – You can search your sequence against one of the following databases build from all KBase reference genomes:

  • KBase non-redundant gene sequences (NR-ffn)
  • KBase non-redundant protein sequences (NR-faa)
  • KBase genome sequences (fna)
  • Search within select genomes: opens the Advanced Options panel, which allows you to select one or more reference genomes and restrict your search to only those genomes.

The non-redundant gene and protein sequence databases are constructed by matching all identical gene or protein sequences using MD5 checksums. Only one representative sequence is included in the BLAST database. The FASTA definition line for the representative sequence summarizes the total number of identical sequences present in the database. As more and more closely related genomes are sequenced and added to the system, using non-redundant sequences makes the searches more scalable and efficient. Without non-redundant sequences, the top results to a search might all be to the identical genes/proteins from closely related genomes, preventing users from seeing any sequence variations or getting distant hits.

Based on the input nucleotide or protein query sequence entered in the box, the non-redundant gene or protein sequence database is selected automatically. You can also select a different database using the drop-down menu to enhance your search.

3. Advanced options – Allows you to select one or more reference genomes and search only against those genomes using the specified program.

Homology Search Advanced Options

The advanced options include:

      1. Genomes – Select one or more reference genomes to narrow your search. As you start typing genus, species, strain name, or KBase genome identifier, the matching reference genomes are displayed for selection. Once you select a genome of interest, use the “+” button to add another genome. You can remove a selected genome using “-” button.

        Please note that selecting genomes automatically restricts the search to only those genomes. It will override the KBase non-redundant gene or protein sequence database selected in the Database field.
      2. Search for Genomic Sequences or Genomic Features – This option allows you to search against genes or proteins, respectively. If Genomic Features is selected, gene or protein databases for genomes will be selected automatically based on the sequence used in the query.
      3. Program – This allows you to perform the search using one of the following five BLAST programs:
        • blastn: search the nucleotide database using nucleotide query
        • blastp: search the protein database using protein query
        • blastx: search the protein database using a translated nucleotide query
        • tblastn: search the translated nucleotide database using a protein query
        • tblastx: search the translated nucleotide database using a translated nucleotide query

        Note that the appropriate program is automatically selected based on the input query sequence and the selection of the database,. You can override it by selecting another program using the drop-down menu.
      4. Max hits and E value threshold – You can change the maximum number of hits displayed and the E value threshold used for filtering the results. The defaults are maximum 50 hits and E value threshold of 10.

4. Sign InSign into your account to add genomes or gene features to your Narratives for further analysis with KBase Apps.

Homology Search Results and Pairwise Alignments

The results from the sequence homology search are presented as a compact tabular view that summarizes the key alignments statistics and as expanded detail view showing pairwise sequence alignments.

Pairwise #1

The compact tabular view lists the top hits matching the query sequence. It shows the function of the gene/protein hit, corresponding genome, subject length, percent identity, percent query coverage, percent subject coverage, BLAST score, and E value. These summary statistics allow you to quickly assess the quality of the BLAST hit. For each gene/protein hit, function is hyperlinked to corresponding Feature Landing Page, which provides detailed information about the feature. Similarly, the genome name is hyperlinked to corresponding Genome Landing Page, which provides further information about the genome.

The check boxes in the beginning of every row can be used to select search results and copy them to narrative. Please note that if the search is against the gene or protein database, then the objects being copied to narrative are genes or proteins as Features. If the search is against genomic sequence database, then the objects being copied to narrative are genomes.

Pairwise #2

You can view the detailed pairwise alignments by:
 

  1. Clicking on the “Expanded Results” button above the table, which results in an expanded view, showing pairwise alignments for all the hits in the table.
  2. Clicking on the Right Button button in a row to see pairwise alignments only for that hit. Similarly, expanded view for a row can be toggled back to collapsed view using the Up Button button.

Pairwise #3

The key features of the pairwise alignment view include:
 

  1. Alignment summary – summarizes the query and subject lengths, BLAST score, expect value, number of identities, positives, and gaps.
  2. Pairwise alignment – shows the alignment of query and subject sequences.
  3. Number of matches – displayed when a search is against a non-redundant gene or protein database. This number summarizes the total number of identical gene/protein sequences currently present in all KBase reference genomes.

When the results are from the search against a non-redundant database, all identical hits are merged and only one representative hit is shown, instead of showing separate hits for every identical feature, with exactly the same score and alignment.

Pairwise #4

There is an expand/collapse button Right Blue Button available next to the “Number of matches”. When clicked, it shows the list of identical genes or proteins. The protein function and genome names are hyperlinked to feature and genome landing pages respectively for detailed information.

Transfer Search Data into Narrative

When you are ready to transfer your search results to your Narrative to analyze them, select the desired Narrative using the “Select a Narrative” button in the top left of the search results page. Clicking this will display a list of Narratives that you own or have access to. Find the desired Narrative in the list and click the name to select it.

Select-Narrative If you have not already done so, select the data you wish to transfer by ticking the checkbox to the left of each data object you want. (As you do this, the Selections count above the shopping cart icon will go up.)

Select DataOnce you have clicked the checkboxes for all of the data objects you wish to transfer, locate the blue button under the image of the shopping cart and click it to transfer all of the selected data into your Narrative. (If you want to deselect all of the data you selected, click the red trash can button.)

Transfer-Data-to-Narrative

Use Transferred Data in a Narrative

After you have transferred search results into your Narrative, you can analyze this data in the Narrative Interface using various Apps and Methods. In your browser, locate the menu in the top left corner and click “Narrative” to access the Narrative Interface from the Search page.Go-to-Narrative

There are various analyses that you can run from the Narrative Interface, with more being added frequently. For a full description of the Narrative Interface, see the Narrative Interface Guide.

Narrative

 

  1. The data object you have transferred will appear in the Data Panel in the top left of the Narrative Interface. Clicking the KBase ID of the data object, “kb|g.3899” in this instance, will induce two changes in the Narrative Interface to assist your analysis:
  2. A filter will be applied to the Apps and Methods section to display only those analytical tools that can be used on that data type. In this case, notice that the filter “type:Genome” has been applied to display Apps and Methods that can be used with the Arabidopsis thaliana genome we transferred.
  3. Additionally, a viewer will be opened in the Main Narrative Panel that displays useful information about the data. For a genome, this includes taxonomic information, the number of known contigs and genes, the length and number of genes present in the contiguous sequences, and functional information about the genes.

There are many other ways to use search results as inputs to KBase analysis tools. Check back soon for more examples, or try experimenting! Keep in mind that the search interface is still in an early phase of development. Please see our Report an Issue page  for information on submitting bug reports or questions.

Explore Data Landing Pages

After performing a search, you may want to see more information about a data object. Notice that in your search results, some of the columns contain information marked with blue text. This text links to the “Data Landing Page” for a selected data object. A Data Landing page provides a detailed summary of an object, with links that enable further data exploration. The number of KBase data types that have Data Landing pages is increasing rapidly.

Since Data Landing pages open in another tab in your web browser, you can move between search results and Data Landing pages simply by clicking your browser tabs.

Genome Data Landing Pages

Within the Genomes category, the Scientific Name links to the Genome Data Landing Page for each organism. In the table listing genomes from your “arabidopsis” query, click on Arabidopsis thaliana to access the Data Landing page for this organism. Note that it can take a while for all the data in the Data Landing page panels to load.

 

Data Summary Page

This screenshot captures only a few of the panels available on the Data Landing page for this genome. Note that each panel is labeled with “kb|g.3899,” the KBase ID for the Arabidopsis thaliana genome. All panels can be moved around the page, removed completely or collapsed by hovering your cursor over the upper right corner to reveal the close and collapse buttons. These options allow you to customize the layout of the Data Landing page you are viewing.

Some panels on this page, such as the Genome Overview, are populated using information from the object. Others contain additional organism information that KBase pulls from external sources like Wikipedia. Several panels in this view may be empty because they are meant to hold information created and owned by a user. For example, scroll down and locate the Taxonomy panel.

Taxonomy

If signed in, you can launch a new Narrative from this panel to build a species tree for the Arabidopsis thaliana genome and run other KBase apps and methods on this data.

Organism Information Panels

Additional panels on the Genome Data Landing page let you browse and explore information about the organism’s contigs and gene list (see image below of the Contig Browser). You can also view a Publications list that provides journal, author, date, and title information. Titles are linked to the corresponding PubMed abstract, which can be read on the Data Landing page itself by hovering over the title.

searchQScontig

Data Object Usage and Provenance Panels

In addition to organism information, a Genome Data Landing page also has panels that track how you and others are using that genome in KBase. For example, notice the panels on the Arabidopsis thaliana page that show which Narratives the genome appears in, who in KBase has used the genome, and a list of data objects referencing the genome. These panels are included on every type of Data Landing page, not just Genomes.

Also, an Object Reference and Provenance Graph at the bottom of the page gives a visual representation of the history and activity of a data object in KBase. Planned for every Data Landing page, these visualizations are interactive, allowing you to hover over portions of the graph to see object details and provenance information and to adjust the view of the graph to center it around a selected object. The Arabidopsis thaliana graph is relatively simple, but the screenshot below shows the provenance of a data object (a Rhodobacter genome) that has been used in numerous KBase analysis steps.

searchQSprovenance

Other Types of Data Landing Pages

Although the Genomes category contains links to only Genome Data Landing pages, other data categories may contain more than one Data Landing page link. For example, results in the Genome Features category have links to Data Landing pages for Features as well as Genomes.

To explore the Data Landing pages for Features, return to the web browser tab that has your search results. In the category navigation options on the left, choose “Return to All Categories” and then select “Genome Features.” Notice that, in addition to the Scientific Name column, the Feature ID column links to Data Landing pages too.

searchQSgenomefeatures

Click the first entry in the Feature ID column to view the Data Landing page for a Locus Feature. Notice that the set of panels displayed on the Locus Feature Data Landing page differs somewhat from the panels on the Genome Data Landing page. For example, the Feature Data Landing page has a Biochemistry panel in place of the Taxonomy panel that is found on Genome Data Landing pages.

searchQSbiochem