A better understanding of the genetic and metabolic mechanisms that confer stress resistance and tolerance in plants is key to engineering new crops through advanced breeding technologies. This requires a systems biology approach that builds on a genome-wide understanding of the regulation of gene expression, plant metabolism, physiology and growth. In this study, we examine the response to drought stress in Sorghum, as we leverage the tools for transcriptomics and plant metabolic modeling we have implemented at the U.S. Department of Energy Systems Biology Knowledgebase (KBase). KBase enables researchers worldwide to collaborate and advance research by allowing them to upload private or public data into the KBase Narrative Interface, empowering them to analyze this data using a rich, extensible array of computational and data-analytics tools, and allows them to securely share scientific workflows and conclusions. We demonstrate how to use the current RNA-seq tools in KBase, applicable to both plants and microbes, to assemble and quantify long transcripts and identify differentially expressed genes effectively. More specifically, we demonstrate the utility of the platform by identifying key genes that are differentially expressed during drought-stress in Sorghum bicolor, which is an important sustainable production crop plant. We then show how to use KBase tools to predict the membership of genes in metabolic pathways and examine expression data in the context of metabolic subsystems. We demonstrate the power of the platform by making the data analysis and interpretation available to the biologists in the reproducible, re-usable, point-and-click format of a KBase Narrative thus promoting FAIR (Findable, Accessible, Interoperable and Reusable) guiding principles for scientific data management and stewardship.
This narrative demonstrates the gene expression profiling and its downstream metabolic mapping analysis in KBase [1] using a subset of the published data set [2] as a use case to investigate the genes and pathways in leaves and roots of Sorghum that are differentially impacted by drought stress.
This narrative targets the short reads dataset from leaf and root samples of a published study [2] of the Sorghum RTx430 genotype after eight weeks of growth under water limitation to simulate drought conditions in the field study. For gene expression profiling, the recently released RTx430 genome from JGI Phytozome is used to guide the transcript assembly [Sorghum bicolor RTx430 v2.1, DOE-JGI, http://phytozome.jgi.doe.gov/]. This genome does not have any functional annotation, therefore, we decided to annotate the genome with plant enzymes prior to running the RNA-seq analysis so that the differentially expressed genes can be characterized based on the currently available enzyme information. Next, the annotated genome is used to reconstruct Sorghum primary metabolism. Finally, the expression abundance is integrated with the annotated enzymes in the metabolic reconstruction to identify and visualize primary metabolic pathways of interest.
This narrative uses a subset of the Sorghum RTx430 reads dataset published by a drought study, EPICON (Epigenetic Control of Drought Response in Sorghum bicolor), funded by the US Department of Energy [2]. Sorghum plant samples of genotype RTx430 were subjected to drought stress and samples were collected 56 days after sowing in the field. From this field experiment, 3rd and 4th leaves and the top 30 cm of the root systems from 10 plants per plot for each sample were taken. The plants were grown in a field (Parlier, CA, USA; 36.6008°N, 119.5109°W) with sandy loam soil containing silky substratum (pH 7.37). Field plots were watered before planting; weekly irrigation started from day 16 [2].The complete data set is available at NCBI and can be accessed by using Project/GEO/SRA accession numbers [Project: PRJNA527782; GEO: GSE128441; SRA: SRP188707]. The published study [2] used the Sorghum reference genome BTx623 for reference guided gene profiling of the raw sequencing reads obtained from Sorghum RTx430 samples.
For this case study, only the data set of week 8 time point (56 days after sowing in the field) from Sorghum genotype RTx430 is selected that includes sequencing reads obtained from leaf and root tissues under two conditions, well-watered (ww) as control and drought-stressed (dr) as pre flowering drought treatment with three replicates per condition. Therefore, a total 12 samples are used for transcriptome analysis (Table S2).
The KBase gene expression profiling and metabolic mapping with primary metabolism workflow is organized in three modules:
Figure 1. A schematic overview of KBase gene expression profiling and metabolic mapping with primary metabolism workflow organized in three modules: Module 1 - import and preparation of genome and sequencing reads; Module 2 - Alignment, assembly and quantification; Module 3 - Metabolic reconstruction, integration and visualization of abundance data.
Module 1 starts with the import of the reference genome and annotation of the reference genome with enzymatic functions of plant primary metabolism using Orthofinder [3]. Next step is import of raw sequencing reads, reads processing using quality check tools such as FastQC [4] and various reads filtering tools such as Trimmomatic [5] and CutAdapt [6].
Module 2 focuses on reference genome guided gene expression profiling based on Tuxedo suite [7,8] and allows users to map mRNA short reads to the annotated genome using HISAT2 [9,10], assemble the transcripts using StringTie [8,11], and identify the counts-based differential gene-expression between drought and well-water control conditions using DESeq2 [12-14]. Note that while gene expression profiling in KBase can be done on any reference genome, here, the PlantSEED/Orthofinder annotation App is run on the genome prior to running the RNA-seq analysis so that the differentially expressed genes can be evaluated in a metabolic context.
Module 3 supports metabolic mapping capabilities and starts with the construction of a primary metabolism model using the annotated genome to identify chemical reactions curated for the metabolic network [15]. The expression abundance generated via Module 2 is readily interoperable with the Module 3 metabolic mapping to characterize the primary metabolic reactions specific to the experimental treatment. Finally, metabolic pathways can be visualized by using Escher Pathway Viewer [16] App, that is adapted in KBase for visualizing metabolic models and related data.
Here are the specific steps to run this use case in this Narrative.
This step imports S. bicolor RTx430 genome as a reference genome serving as the basis for RNA-Seq alignment (S. Bicolor RTx430 genome). The genome was taken from JGI Phytozome V13.
This step annotates the (S. bicolor RTx430 genome) with metabolic enzymes based on PlantSEED models using "Annotate Plant Enzymes with OrthoFinder" App. It results into the annotated genome as an output object "SbiRTx430v2.1_annotated", as given in the Data Pane (Table S3).
This step imports 12 SRA files as SingleEndLibrary Objects from NCBI in the Data Pane using "Import of SRA reads from Web" App.
This step groups all the 12 imported reads SingleEndLibrary Objects based on the condition and tissue type and creates a Sample Set "RTx430_sampleset". This is used as an input for FastQC, HISAT2 Alignment and DESeq2 differential expression to simplify the process.
This step provides the reads quality assessment for each individual sample of "RTx430_sampleset."
This step aligns each SRA reads for individual sample from the the SampleSet "RTx430_sampleset" to the SbiRTx430 v2.1 annotated reference genome using HISAT2 and provides position-level coverage for each sample. Additionally, it generates a merged alignment set object labelled as "RTx430_sampleset_alignment_set" in the Narrative, which can be run in batch mode in the next step of StringTie.
This step generates gene expression abundance for each sample and an expression set object “RTx430_sampleset_gene_expression_set” that wraps multiple sample results into a single expression set for downstream analysis. In addition, it also provides the normalized expression matrix for all samples in both FPKM and TPM formats for easy download from data pane. This App also generates interactive histogram plots for expression abundance in FPKM and TPM. The normalized expression matrix "RTx430_sampleset_TPM_ExpressionMatrix" is also available as a supplementary file for those interested in comparing relative expression across multiple samples (Table S5).
In the next step, to get the average abundances for each gene in each condition averaged across the biological replicates “Create Average Expression Matrix” App is run on the normalized expression matrix “RTx430_sampleset_TPM_ExpressionMatrix”, resulting in “RTx430_sampleset_TPM_ExpressionMatrix_average” expression matrix as an output. This average expression matrix is later used in the workflow to assign reaction level expression scores to study plant primary metabolism.
This step calculates gene-annotation expression-level differences by condition using DESeq2. This step calculates the all-by-all differential expression matrix. “Create Differential Expression Matrix using DESeq2” App is run on the multi-sample expression set “RTx430_sampleset_gene_expression_set” resulting in a “DifferentialExpressionMatrixSet” object “RTx430_DESeq2_all”. This object groups a set the six “DifferentialExpressionMatrix” objects that represent all possible pairwise combinations of four treatments. The complete lists of differential expression analysis of leaf and root samples of SbiRTx430 genome under drought stress are provided as Table S6 and Table S7 respectively.
This step runs “Create up/down regulated FeatureSet and ExpressionMatrix” App to get the significant differential expression of up and down regulated genes across all samples with normalized expression “RTx430_sampleset_TPM_ExpressionMatrix_average” and differential expression “RTx430_DESeq2_all” as input. This App selectes genes that exhibit differential expression (either upregulated or downregulated) for drought vs well-watered conditions based on the given statistical threshold of FDR-adjusted p-value (alpha cut off) and log2 fold change. This App is run separately for leaf and root tissues based on the adjusted p-value below 0.05 and absolute log2 fold change greater than or equal to 1 and 2 respectively.
This step reconstructs genome-scale metabolic networks of plant primary metabolism using RTx430 genome that has previously been annotated with plant enzymes using OrthoFinder. It gives "SbiRTx430v2.1_reconstructed" object as an output in the Data pane.
This step integrates the normalized expression matrix generated by StringTie “RTx430_sampleset_TPM_ExpressionMatrix_average”, with the reconstructed FBA model "SbiRTx430v2.1_reconstructed" generated by the previous step to generate the reaction matrix. The complete list of reaction matrices generated by this study for each experimental condition (drought and well-watered as control) in leaves and roots are given as supplementary tables S8 and S9 respectively.
This step uses “Escher Pathway Viewer” app to displays the highly differentially expressed genes identified by DESeq2 on a global metabolic map of plant metabolism based on the gene-reaction associations in the PlantSEED reconstructed metabolic model.
Sorghum bicolor RTx430 (v2.1) genome was imported directly from the JGI Phytozome site (S. bicolor RTx430 v2.1, DOE-JGI, http://phytozome.jgi.doe.gov/) through globus into KBase staging area and includes the JGI v2.0 assembly of Sorghum bicolor and the JGI v2.1 annotation as “SbicolorRTx430_552_v2.fa.gz” and “SbicolorRTx430_552_v2.1_gene.gff3.gz” files respectively.
Note, this particular genome has no functional annotation information following import because the GFF file that was imported included no functional annotations (only structural annotations). If a GFF file does contain functional annotations, those would have been added to the genome during import. The next step of the workflow is to functionally annotate this genome in KBase before it is subjected to RNA-seq analysis.
SbiRTx430v2.1 genome is annotated with metabolic enzymes using PlantSEED models using "Annotate Plant Enzymes with OrthoFinder" App.
From this table, for the RTx430v2.1 genome, the Plant OrthoFinder app clustered 1,766 protein sequences with 1,419 PlantSEED-curated genes.
This App provides the “SbiRTx430v2.1_annotated” as an output in Data Pane. It provides the table as output with all predicted enzymes in the KBase Genome and the curated Arabidopsis enzymes in the PlantSEED database (Table S3).
Table S2. SRA read files of leaves and roots of RTx430 at 56 days timepoint under control and preflowering drought conditions
This step groups Reads (SingleEndLibraries objects) to RNASeqSampleSet.
Quality assessment of the reads of the "RTx430_sampleset" object using FastQC. This step provides reads quality assessment for each sample.
FastQC generates an output of a comprehensive multi-page report on the composition and quality of reads in HTML format, with one page for each of the reads (e.g. Single End, Paired End: forward, Paired End: reverse). The report can be viewed inside the Narrative or as a new web page that can also be downloaded.
The HTML report includes results from multiple modules that were run by FastQC, and provides a quick assessment of the quality of the results labeled as normal (green checkmark), slightly abnormal (orange triangle), and very unusual (red cross) reads.
Align SRA reads in the RTx430_sampleset object to the SbiRTx430 v2.1 annotated reference genome using HISAT2 by selecting all the default parameters.
The alignment results show a very high percentage of the reads’ mapping (between 91.33% to 99.1%).
Another interesting observation is that a very low percentage of reads (between 2.40% to 3.92%) mapped to multiple locations in the RTx430 genome.
QualiMap generates a comprehensive HTML report as well on the quality of the BAM alignments. The mean mapping quality of all the samples is 58.33.
PCA plot shows that samples can be grouped based on tissue.
The BAM alignment objects can be downloaded to visualize the aligned reads outside KBase using JBrowse and Integrative Genomics Viewer (IGV).
Assemble RNA-seq alignments into transcripts with “Assemble Transcripts using StringTie'' App using "RTx430_sampleset_alignment_set'' object as an input and default values for all other parameters. In KBase, the StringTie App is configured to use known transcript models (following the reference annotation guided assembly process) for the expression analysis.
To find the average abundances for each gene in each condition, the normalized expression matrix is averaged across the biological replicates for each condition.
Perform differential expression calculations using DESeq2. This step calculates gene-annotation expression-level differences by condition.
This step generates the six “DifferentialExpressionMatrix” objects which represent all possible pairwise combinations of four treatments and groups these six objects in a set object "RTx430_DESEq2_all".
Clicking on the DifferentialExpressionMatrixSet object "RTx430_DESeq2_all" displays the set viewer and provides interactive volcano plot for each sample. Volcano plot allows the visual identification of the significant genes based on the given statistical threshold cut off of significance -log10 P value along the y-axis and log2 fold change along the x-axis. In a volcano plot, the genes that pass the P value and fold change thresholds, the most upregulated genes are displayed towards the right, the most downregulated genes are towards the left, and the most statistically significant genes are towards the top. On the given threshold cut off, it also provides the list of genes with p- value, q-value, Significance (-log10) and Fold change (log2) under Gene Table tab that can be exported.
In addition, DESeq2 generates the Dispersion plot and PCA plot to help understand the inter-sample expression variability.
The dispersion plot can be seen as a scatter plot with log2 fold change along the y-axis and normalized mean expression along the x-axis. It is essentially a measure of expression variance for a given mean expression for all genes. As expected, the variability in fold changes is relatively higher for lowly expressed genes.
Based on the PCA plot, the first principal component (PC1) accounts for 91% variation among the samples based on tissue (leaf vs root). The second principal component (PC2) accounts for a minor variance of only 6% among the samples based on the treatment (drought vs control).
The complete lists of differential expression analysis of leaf and root samples of SbiRTx430 genome under drought stress are provided as Table S6 and Table S7 respectively.
“Create up/down regulated FeatureSet and ExpressionMatrix” App is used to get the significant differential expression of up and down regulated genes and differential expression matrix based on different fold levels.
We ran this app two times to get the differential expression genes and differential expression matrix of “selected pairwise conditions” of leaves drought versus well-watered control and roots drought versus well-watered control based on the adjusted P-value below 0.05 and log2 fold change above 1 and 2 respectively.
Log2Fold change
Specific Pairwise Conditions
The differential expression analysis reported 86.4% of the total genes are differentially expressed, displaying a massive response to drought. Out of this, a total of 12,375 genes (41.39% of total differentially expressed) show differential expression based on log2-fold change greater than 1 and FDR-adjusted q value of less than 0.05.
With higher statistical threshold cut off (at least a log2-fold change of greater than 2) and with same FDR-adjusted q-value of less than 0.05, there are a total of 5159 genes (17.25% of total differentially expressed) that are significantly affected by the drought stress at this cut off.
Though both tissues (leaves and roots samples) exhibit a widespread response to the drought, we observed that the root samples show a larger number of differentially expressed genes compared to the leaves samples.
We also observed that roots undergo more down-regulation than up-regulation of genes in response to drought for efficiently triaging and distributing natural resources above ground.
We observed that only 177 genes, out of a total of 5,159 significantly differentially expressed genes based on log2 fold change 2 or above, in leaves and roots, are annotated with enzymes that have homology to known biological functions. Therefore, a large number of differentially expressed genes are still completely uncharacterized.
Based on the functionally annotated genes, we found 24 genes are differentially expressed in both leaves and roots. There are 23 genes that are differentially expressed only in leaves and 105 genes differentially expressed only in roots.
Mapping of the gene expression data to individual reactions in a model in the next section provides information on the functionally implicated pathways and associated metabolic reactions and more specifically about the metabolic roles of differentially expressed genes.
For example, the differentially expressed gene SBIRTX430.09G168000, SBIRTX430.03G383900 upregulated in roots and leaves, annotated as Gamma-glutamyl phosphate reductase (EC 1.2.1.41) / Glutamate 5-kinase (EC 2.7.2.11) in cytosol and plastid is involved in Proline metabolism. Another example is SBIRTX430.07G059900, downregulated in both leaves and roots, annotated as Chalcone synthase (EC 2.3.1.74), in nucleus and endoplasm is involved in lignin biosynthesis.
This step reconstructs genome-scale metabolic networks of plant primary metabolism using RTx430V2.1_annotated genome that was annotated with plant enzymes using OrthoFinder earlier.
This step integrates the normalized expression matrix generated by StringTie, with the reconstructed FBA model generated by the previous step to generate the reaction matrix.
SbiRTx430v2.1_reconstructed
Expression Conditions
The app generally applies the "best" gene abundance to a reaction, where there are multiple genes, and where there are genes encoding protein complexes. The underlying concept is that a reaction is only as "active" in its catalysis as allowed by the formation of the enzymes that catalyze the reaction. In turn, the results here can be used as a proxy for the activity of different primary metabolic pathways, and how they may respond to different experimental conditions.
The output as reaction matrices of leaves and roots are given in Supplementary Tables S7 and S8 respectively.
This step displays the highly differentially expressed genes identified by DEseq on a global metabolic map of plant metabolism based on the gene-reaction associations in the PlantSEED reconstructed metabolic model.
An excerpted and modified version of the map is displayed in Figure 5 in the text.
Mapping transcript abundances to a metabolic model provides an opportunity to compare metabolism between plant tissues. In the model, ten CHS synthase genes in S. bicolor RTx430 are linked to the same reaction. However, different chalcone synthase genes drive the reaction expression scores in roots and leaves. In roots, the transcript abundance is highest for an isozyme on chromosome 5; in leaves, it is highest for a copy on chromosome 7.
In addition, the set of lignin biosynthesis reactions showing drought-altered transcription patterns differs between root and leaf samples. Leaf samples have lower transcript abundances for caffeoyl-CoA 3-O-methyltransferase (CCoAOMT) in drought than in control conditions. In contrast, roots have lower transcript abundances for ferulate 5-hydroxylase (F5H) in drought conditions, potentially pointing to different changes in the levels of lignin precursors and in lignin composition between the tissues.
The genes and metabolites linked to these reactions in the model can guide the design of hypothesis tests with targeted expression and metabolite measurements.