Cluster Expression Data — K-Means

Not yet updated for Release 3.0

The instructions in this document are for Release 2.0. The December 2016 release looks a bit different, though the overall operation is similar. This document will be updated soon.

Description of tutorial

This tutorial will guide you through the steps needed to use the Cluster Expression Data – K-means app in the KBase Narrative Interface.

In this tutorial, we will:

  • Add an expression matrix to our Narrative.
  • Find and insert the Cluster Expression Data — K-Means app into our Narrative.
  • Use the app to generate a set of k-means clusters for the selected expression matrix.
  • Examine the resulting k-means clusters.
  • Describe how this app can be used to investigate patterns of gene expression.

Description of the app

This app enables users to observe and analyze patterns of gene expression by grouping expression data via k-means clustering, a data-partitioning algorithm that assigns n observations to exactly one of k clusters. K-means clustering is useful for discovering functionally related sets of genes, investigating regulatory networks of gene expression, and deducing unknown gene functions by observing and grouping their expression patterns in different conditions.

For more information, please see the details page for this app.

Description of the input

The Cluster Expression Data — K-Means app takes as input an expression matrix that references features in a given genome and contains information about gene expression measurements taken under given sampling conditions. Before importing an expression dataset, a genome associated with the features listed in the expression data must be added to the Narrative. Detailed instructions for uploading an expression matrix can be found in the Data Upload and Download Guide.

This app also requires users to specify a value for k.

Description of the output

The Cluster Expression Data — K-Means app generates a FeatureClusters data object that contains the clusters of features identified by the k-means clustering algorithm.

Point and click instructions for using this app

Step 1. Add data that you want to analyze

Before launching the Cluster Expression Data K-Means app, you will need to add to your Narrative the expression matrix you want to use, as well as the genome associated with the expression data.

For this tutorial, we will use the Escherichia coli K-12 MG1655 genome available in KBase’s reference data collection and a corresponding gene expression dataset from the Many Microbe Microarrays Database (M3D).

To add the genome to your Narrative, find the Data Panel along the left side of the screen and click the Add Data button. This will open the Data Browser slideout.

Select the Public tab at the top of the slideout, ensure the search category is set to Genomes, and search for “Escherichia coli K-12.” Mouse over the genome labeled “Escherichia coli str. K-12 substr. MG1655” (the one with 4545 genes) and click the blue Add button. The genome will appear in your Data Panel.

Next, find the gene expression dataset for the MG1655 genome by clicking the following link, which will automatically download the expression file to your computer. (This may take 2 to 3 minutes.)

Once the file has downloaded, unzip it and navigate to the folder named “E_coli_v4_Build_6.” Now locate the file named “” Expression matrices must be in the tab-separated values (TSV) format for import into KBase, so be sure to change the file extension from “.tab” to “.tsv.”

Now you are ready to import this expression dataset into KBase. For instructions, see the Expression Matrix section of the Data Upload and Download Guide. Note that you will need to provide a name for the expression data that you are uploading. For this example, we used “E_coli_ExpressionMatrix.”

Notice that your Data Panel now shows the two objects that you added to your Narrative:


You can find out more about these data objects by mousing over their record in the Data Panel and clicking the “…” that appears. An expanded view of the data object will open:


The icons in this view let you see a data summary, download the object, see its provenance, and more. (Please see the Explore Data section of the Narrative Interface User Guide for more information.)

You also can examine a data object by either clicking on its name in the Data Panel or dragging and dropping it into the main Narrative. This will open a viewer for the selected data, in this case an expression matrix:


Step 2. Add and run the app

Now that you have the needed input data, you can add the Cluster Expression Data — K-Means app to your Narrative. Look closer at the Apps Panel directly below your data. Locate the app in the list and click on its name or icon to add it to your Narrative.

To run the Cluster Expression Data — K-Means app, you must first fill out each input field for the app. In-depth descriptions for all inputs for this app are provided in the app details page.


  • In the Expression Matrix field, use the field’s pulldown list to select the expression dataset that you imported into your Narrative.
  • For Number of Clusters (k), type in the number of clusters that you want to group the expression data into. The value of k can be estimated various ways, including KBase’s Estimate K for K-Means Clustering For demonstration purposes, we will enter “7” for our k value.
  • In the Output Set of Clusters field, we’ll name the resulting FeatureClusters data object “E_coli_Clusters.”

Notice that as you fill in the required parameter fields, the red arrows next to those fields change to green checkmarks. Once all required fields have a green checkmark, the app is ready to run.

Click the Run button at the bottom of the cell to launch the clustering job. A blue box will appear around the app cell, and a message at the bottom of the cell will indicate that the job was submitted.

Depending on the length of the queue and the size of your dataset, the clustering algorithm may take anywhere from a few minutes to a couple of hours to finish. You can monitor the status of the job by selecting the Jobs tab near the top left of the page.


Step 3. Look at the output

Once complete, the Cluster Expression Data — K-Means app generates a FeatureClusters data object that appears in your Data Panel:


An output cell also appears below the app in the Narrative panel, allowing you to browse the results of this analysis.


The Overview tab contains information about (1) the number of feature clusters identified by the app, (2) the genome associated with the features, and (3) the number of conditions and genes present in the expression matrix.

Click the Clusters tab to view the calculated data clusters and to access options for further analysis. Since we set the value of k to 7, there are 7 clusters available to explore. Each row in the Clusters tab contains information about the number of genes correlated within a cluster and a value representing the mean correlation of the expressed features contained within the cluster.

In the right column, a dropdown menu labeled Explore Cluster contains several options for further analyzing each cluster, including View expression profile, View pairwise correlation, View in sortable condition heatmap, and Save as a FeatureSet.

For this example, we will examine “cluster_5” because it has a small number of genes and a high correlation value.

View expression profile

This option plots the average expression values of all features contained in the expression matrix to the average expression values of selected features across the different conditions. Viewing the expression profile enables you to compare the gene expression values within a given cluster of features to the expression values across the full expression dataset. This comparison is useful for observing patterns of variance in gene expression and identifying conditions that correlate with particularly high or low values of expression relative to the rest of the expression data.

The Show/Hide Selected Features button at the top of the plot allows you to view summary statistics of expression values for selected features and sort the features contained in the cluster by these values. Mouse over the points on the plot to view condition-specific average expression values.


View pairwise correlation

This option allows you to explore the expression relationship between two features using a heatmap. The heatmap displays features that have positively correlated expression in blue and negatively correlated expression in yellow. The Show/Hide Selected Features button at the top of the plot provides summary statistics of expression values for selected features and lets you sort the features contained in the cluster by these values. Mouse over the colored squares on the heatmap to view the specific correlation value between two features.


View in sortable condition heatmap

This option allows you to explore expression data within the sample conditions as a sortable heatmap. Conditions are displayed on each row, and columns list statistics for the minimum expression value, maximum expression value, average expression value for all features within a condition, and the standard deviation for the set of expression values. You can sort the table according to any of these statistical values by clicking the appropriate column header.

At the bottom of the viewer, you can set the color range for the minimum and maximum expression values to aid in observing expression patterns. Soon, you will be able to mouse over the heatmap to view specific expression values within a condition.


Save as a FeatureSet

Soon you will be able to save each cluster as its own FeatureSet for further analysis.