Cluster Expression Data - K-Means

KBaseFeatureValues

v.0.0.22

By: rsutormin, msneddon, psnovichkov, marcin, srividya22

Launch

Perform K-means clustering to group expression data for observing and analyzing patterns of gene expression.

This App enables users to observe and analyze patterns of gene expression by grouping expression data via K-means clustering. K-means clustering is useful for discovering functionally related sets of genes, investigating regulatory networks for gene expression, and deducing unknown gene functions by observing and grouping their expression patterns in differing conditions.

Begin by selecting or importing both the expression dataset to analyze and the genome associated with the expression dataset using the Add Data button. Next, specify a value for K. The Estimate K for K-Means Clustering App should be used to assist in determining an optimal value for K. Then provide a name for the output set of clusters. Finally, define the number of starts and iterations, select the K-means clustering algorithm to use for the analysis, and input a random seed value.

The input is a .tsv file with "gene-id" listed in the A1 cell, the gene IDs listed in the A column, the sample/conditions identifiers in the first row, and the expression values that correspond to the gene-ids and sample throughout. For a comprehensive guide to formatting your expression data for import into KBase, see the Data Upload/Download Guide.

Description of k-means clustering algorithms:

Hartigan-Wong (default): An efficient algorithm with fast initial convergence that optimizes the within-cluster sum of squares.
Lloyd: An algorithm with discrete data distribution that optimizes the total sum of squares; for use on large data sets.
Forgy: An algorithm with continuous data distribution that optimizes the total sum of squares; for use on large data sets.
MacQueen: An algorithm with fast initial convergence that optimizes the total sum of squares.

This App is based on the amap package for R.

NOTE: This App is one of the steps in the Transcriptomics and Expression Analysis Workflow in KBase, however it can also be run as a standalone.

Team members who implemented algorithm in KBase: Paramvir Dehal, Roman Sutormin, Michael Sneddon, Srividya Ramakrishnan, Pavel Novichkov, Keith Keller. For questions, please contact us.

Related Publications

Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163 , https://www.nature.com/articles/nbt.4163

App Specification:

https://github.com/kbaseapps/FeatureValues/tree/6cdc50905a08883a53333c073abe1e1df7b3f97f/ui/narrative/methods/expression_toolkit_cluster_k_means

Module Commit: 6cdc50905a08883a53333c073abe1e1df7b3f97f