Expression Matrix

The Expression Matrix data type contains gene expression values taken under given sampling conditions.

Formatting Expression Matrix TSV files

If you are importing expression data from an external source or choose to populate a file with your own data, please ensure that it is formatted properly for use with KBase. The tab-separated values (TSV) file is a tab delimited text file that contains genes across the rows and sample observations across the columns. Please make sure the first label in the first column is “gene-id” followed by tab-delimited labels for samples.

Each gene measured in the expression dataset should have an identifier listed in the first column of the TSV file. To ensure that the gene identifiers listed in your dataset correspond to the aliases contained within KBase, start by adding the Genome of the organism used in the expression dataset to the Data Panel in your Narrative. Once the Genome has been added to your Data Panel, click the name of the Genome to open up the viewer. Click on the tab labeled Genes and locate the gene of interest by searching for the name of the function or protein associated with the gene.

Gene Search

Click the Gene ID of the gene of interest to open up a tab with additional information about the gene and then click the Gene ID contained within this tab to open up a Data Landing page for this gene.

Gene-Tab

On the Data Landing page, locate the section titled Aliases and crosscheck the gene labels contained within your expression dataset with these aliases to ensure that these labels will correspond to features in KBase.

Data-Summary

Some of the gene aliases supported by KBase include NCBI, EMBL, UniProt, BioCyc, and ASAP.

Each sample condition should be labeled in the first row of the TSV file. The remaining cells in the table contain expression values for the appropriate gene and sample. Be sure to exclude gene features that are missing all expressions or are composed of non-changing expressions across the samples.

Below is an example of a properly formatted expression data file in TSV format. In this case, the gene-ids in the first correspond to gene identifiers for E. coli K-12 MG1655 genes and the sample conditions are derived from the Many Microbe Microarrays Database (M3D).

 

gene-id dinI_U_N0025_r1 dinI_U_N0025_r2 dinI_U_N0025_r3
b4634 9.05367 9.07827 9.10114
b3241 7.20924 7.08695 7.07071
b3240 7.21535 7.14312 7.19478

Download an empty template for building an expression matrix compatible with KBase to populate with your own data.

Additional Information for Plant Expression Data

For KBase plant genomes, the gene ids retain the data structure from the external source databases (Ensembl or Phytozome) and do not have aliases as mentioned above. When constructing an expression dataset, append your gene ids with the transcript ids followed by “.CDS” as seen in the screenshot below. You can check that you have the correct gene ids using the same method detailed in the Formatting Expression Matrix TSV files section.

Plants1

Upload an Expression Matrix from a TSV formatted file

Expression datasets can be uploaded into KBase as a tab-separated values (TSV) file with a .tsv or .tab file extension. For this example, we will upload a expression dataset containing expression values for Escherichia coli K-12 MG1655 taken under a variety of sampling conditions from the Many Microbe Microarrays Database (M3D).

In order to successfully upload an Expression Matrix into KBase, you first need to add the Genome that corresponds to referenced in the Expression Matrix you wish to upload. For this example, add the Escherichia coli str. K-12 substr. MG1655 Genome to your Data Panel from the Public tab of the Data Browser before importing the expression dataset.

To add the genome to your Narrative, find the Data Panel along the left side of the screen and click the Add Data (or red “+”) button. This will open the Data Browser slideout. Select the Public tab at the top of the slideout, ensure the search category is set to Genomes, and search for “Escherichia coli K-12.” Mouse over the genome labeled “Escherichia coli str. K-12 substr. MG1655” and click the blue Add button.

Genome

Now that the genome is loaded into the Narrative, we can import the gene expression dataset. A gene expression dataset for Escherichia coli str. K-12 substr. MG1655 can be downloaded from this link:

http://m3d.mssm.edu/norm/E_coli_v4_Build_6.tar.gz

Once the file has finished downloading, navigate to the folder named “E_coli_v4_Build_6” and locate the file named “E_coli_v4_Build_6_chips907probes4297.tab” in the list of files.

Select the Expression Matrix type in the dropdown (in Import tab) and click Next. Select the TSV file for upload and click Import.

  • Choose Expression Matrix from the data type dropdown menu
  • Click the Next button
  • Select the Expression Data TSV file from a directory on your computer
  • Provide a name for the Expression Matrix data object
  • Select the Genome that contains features referenced by the Expression Matrix
  • Select the appropriate Value Type that corresponds to the scale of the values present in your dataset; if you are unaware of the value type, select Unknown
  • Click the Import button
  • After the import process has completed, the ExpressionMatrix data object will appear in your Data Panel

Import