The Expression Matrix data type contains gene expression values taken under given sampling conditions. An Expression Matrix can be used as input for several KBase analyses including the Identify Differentially Expressed Genes, Cluster Expression Data – K-Means, Cluster Expression Data – Hierarchical, and Estimate K for K-Means Clustering methods.
If you are importing expression data from an external source or choose to populate a file with your own data, please ensure that it is formatted properly for use with KBase. The tab-separated values (TSV) file is a tab delimited text file that contains genes across the rows and sample observations across the columns. Please make sure the first label in the first column is “gene-id” followed by tab-delimited labels for samples.
Each gene measured in the expression dataset should have an identifier listed in the first column of the TSV file. To ensure that the gene identifiers listed in your dataset correspond to the aliases contained within KBase, start by adding the Genome of the organism used in the expression dataset to the Data Panel in your Narrative. Once the Genome has been added to your Data Panel, click the name of the Genome to open up the viewer. Click on the tab labeled Genes and locate the gene of interest by searching for the name of the function or protein associated with the gene.
Click the Gene ID of the gene of interest to open up a tab with additional information about the gene and then click the Gene ID contained within this tab to open up a Data Landing page for this gene.
On the Data Landing page, locate the section titled Aliases and crosscheck the gene labels contained within your expression dataset with these aliases to ensure that these labels will correspond to features in KBase.
Some of the gene aliases supported by KBase include NCBI, EMBL, UniProt, BioCyc, and ASAP.
Each sample condition should be labeled in the first row of the TSV file. The remaining cells in the table contain expression values for the appropriate gene and sample. Be sure to exclude gene features that are missing all expressions or are composed of non-changing expressions across the samples.
Below is an example of a properly formatted expression data file in TSV format. In this case, the gene-ids in the first correspond to gene identifiers for E. coli K-12 MG1655 genes and the sample conditions are derived from the Many Microbe Microarrays Database (M3D).
Download an empty template for building an expression matrix compatible with KBase to populate with your own data.
Additional Information for Plant Expression Data
For KBase plant genomes, the gene ids retain the data structure from the external source databases (Ensembl or Phytozome) and do not have aliases as mentioned above. When constructing an expression dataset, append your gene ids with the transcript ids followed by “.CDS” as seen in the screenshot below. You can check that you have the correct gene ids using the same method detailed in the Formatting Expression Matrix TSV files section.
Expression datasets can be uploaded into KBase as a tab-separated values (TSV) file with a .tsv or .tab file extension. For this example, we will upload a expression dataset containing expression values for Escherichia coli K-12 MG1655 taken under a variety of sampling conditions from the Many Microbe Microarrays Database (M3D).
In order to successfully upload an Expression Matrix into KBase, you first need to add the Genome that corresponds to referenced in the Expression Matrix you wish to upload. For this example, add the Escherichia coli str. K-12 substr. MG1655 Genome to your Data Panel from the Public tab of the Data Browser before importing the expression dataset.
To add the genome to your Narrative, find the Data Panel along the left side of the screen and click the Add Data (or red “+”) button. This will open the Data Browser slideout. Select the Public tab at the top of the slideout, ensure the search category is set to Genomes, and search for “Escherichia coli K-12.” Mouse over the genome labeled “Escherichia coli str. K-12 substr. MG1655” and click the blue Add button.
Now that the genome is loaded into the Narrative, we can import the gene expression dataset. A gene expression dataset for Escherichia coli str. K-12 substr. MG1655 can be downloaded from this link:
Once the file has finished downloading, navigate to the folder named “E_coli_v4_Build_6” and locate the file named “E_coli_v4_Build_6_chips907probes4297.tab” in the list of files.
Select the Expression Matrix type in the dropdown (in Import tab) and click Next. Select the TSV file for upload and click Import.