Contributors: David Kainer [1,2,✝], Matthew Lane [3,✝], Kyle A. Sullivan [1], J. Izaak Miller [1], Mikaela Cashman [1,4], Mallory Morgan [1], Ashley Cliff [3], Jonathon Romero [3], Angelica Walker [3], D. Dakota Blair [5], Hari Chhetri [1], Yongqin Wang [5], Mirko Pavicic [1], Anna Furches [3], Jaclyn Noshay [1], Meghan Drake [1], Natalie Landry [1], AJ Ireland [4], Ali Missaoui [5], Yun Kang [7], John Sedbrook [8], Paramvir Dehal [4], Shane Canon [4], Daniel Jacobson [1,*]
Software DOI: https://doi.org/10.11578/dc.20220607.1
An integral part of systems biology is to study the interactions within and across omics layers ranging from the genome to the phenome in order to study the interactions between cells and tissues that lead to compelling phenotypes. Often a Genome Wide Association Study (GWAS) looks to identify variation in genotypes across a population and associate them with a variation in a phenotype in the same population. GWAS alone is limited to what genetic variation is present within the population presenting a very specific but not comprehensive view. Often, the statistical thresholds used in GWAS tend to be very stringent in order to screen out false positives. However, interesting biology may be missed based on these limitations. Thus, we present tools for networks biology to overcome these limitations to better filter GWAS information as well as fill in gaps from other omics layers.
Random walk with restart (RWR) is a type of graph traversal algorithm. In this context, the graphs being traversed are biological networks where the nodes are genes and edges between genes represent a known biological relationship between those two genes. For example, one network may be constructed from protein-protein interaction data, another from metabolic pathway data, and another from co-expression data. A multiplex network may contain one or more biological networks where the topology of each individual layer is maintained separately from other network layers, but connections are added between shared genes among each network layer. RWR explores these networks by "walking" from gene to gene along edges. In a multiplex network, the walker will utilize all network layers by transporting between the layers based on common nodes (common genes). For more information on multiplex networks or graph traversal methods please see referenced work.
RWRtools provides a set of useful methods based on the graph traversal algorithm "random walk with restart". The restart emphasizes that the walker will occasionally restart back at a starting gene (a seed gene) to balance local and global exploration.
The RWRtools pipeline requires a gene set of interest which designates a set of seed genes from which the walker will start at. This set is created using the Build FeatureSet from Genome application in KBase. Some RWRtools methods can also take in an additional gene set called a query gene set that will filter the results of the walker. The gene set will be chosen based on the context of the user's biolgical query. In this demo we present an example with shoot biomass and height genes in Arabidopsis thaliana.
Another parameter of the RWRtools applications is to choose the network multiplex to set the context of the random walker. RWRtools contains a series of pre-computed Arabidopsis thaliana multiplex networks, details of each multiplex can be found here. If you are interested in using RWRtools for your own organism or networks please file a ticket for a feature request to expand RWRtools on the KBase help board.
RWRtools currently provides two main methods:
References:
David Kainer, Matthew Lane, Kyle A. Sullivan, J. Izaak Miller, Mikaela Cashman, Mallory Morgan, Ashley Cliff, Jonathon Romero, Angelica Walker, D. Dakota Blair, Hari Chhetri, Yongqin Wang, Mirko Pavicic, Anna Furches, Jaclyn Noshay, Meghan Drake, Natalie Landry, AJ Ireland, Ali Missaoui, Yun Kang, John Sedbrook, Paramvir Dehal, Shane Canon, Daniel Jacobson. RWRtoolkit: multi-omic network analysis using random walks on multiplex networks in any species.
Goodstein, D. M., Shu, S., Howson, R., Neupane, R., Hayes, R. D., Fazo, J., ... & Rokhsar, D. S. (2012). Phytozome: a comparative platform for green plant genomics. Nucleic acids research, 40(D1), D1178-D1186.
Valdeolivas, A., Tichit, L., Navarro, C., Perrin, S., Odelin, G., Levy, N., ... & Baudot, A. (2019). Random walk with restart on multiplex and heterogeneous biological networks. Bioinformatics, 35(3), 497-505. doi:10.1093/bioinformatics/bty637
This narrative presents a demonstration of RWRtools methods to the important bioenergy feedstock Panicum virgatum (Switchgrass) by leveraging networks from the model plant Arabidopsis thaliana.
We used a genome-wide association study (GWAS) to identify single nucleotide polymorphisms (SNPs) associated with Switchgrass shoot biomass. We then assigned these SNPs to Switchgrass genes, and using Phytozome (Goodstein et al., 2012), assigned these 38 Switchgrass GWAS genes to 32 Arabidopsis orthologs. These 32 genes are the starting set of genes (i.e. the seed gene set) for this demo narrative.
The rest of the narrative is as follows:
The RWRtoolkit pipeline begins by importing or adding the genome of interest. For this demo we add a genome for Arabidopsis already within KBase. We then create a feature set which is a list of genes of interest to serve as our seed geneset. We create our feature set of 32 genes of interest associated with shoot biomass according to a reduced threshold on a GWAS. This resulting FeatureSet is the data object Switchgrass_ShootBiomass_AraOrthologs.
Now we are prepared to run RWRtools CV using the FeatureSet of Shoot Biomass created above as our Seed Gene Keys and we select the Comprehensive Network as our multiplex. Details on all available multiplexes can be found here.
Results
Analysis
Next we build a secondary gene set starting with our 32 shoot biomass gene set, but removing the isolated genes we suspect of being noise, resulting in a set of 29 shoot biomass genes that we suspect have a stronger association to the trait of interest. We then re-apply RWRtools CV to see what the inter-connectivity is in the filtered gene set.
In the resulting network, we now see a high degree of connectivity within this updated list of genes of interest. We can further explore these genes and their surrounding connections.
Now that we have demonstrated that is a high degree of connectivity among our genes using RWRtools CV, next we want to see what relationships exist between the filtered 29 shoot biomass genes and other genes in the genome. To answer this we run RWRtools LOE and look at the top 200 ranked genes.
Results
Sphingolipid/ceramide synthesis
Shoot apical meristem development and homeodomain transcription factor connections
These highlight several examples of how these networks can be used to explore relationships between genes and discover interesting and meaningful biology. For the use case presented here, these insights derived from Arabidopsis thaliana networks led to the development of a conceptual model of Switchgrass genes contributing to well watered shoot biomass, a trait that is important for enhanced bioenergy feedstocks.
References:
Ram H, Sahadevan S, Gale N, Caggiano MP, Yu X, Ohno C, et al. An integrated analysis of cell-type specific gene expression reveals genes regulated by REVOLUTA and KANADI1 in the Arabidopsis shoot apical meristem. PLoS Genet. 16:e10086612020, doi: 10.1371/journal.pgen.1008661
Zhang N, Zhao B, Fan Z, Yang D, Guo X, Wu Q, et al. Systematic identification of genes associated with plant growth-defense tradeoffs under JA signaling in Arabidopsis. Planta. 251:432020, doi: 10.1007/s00425-019-03335-8