This Narrative is created for the intro to KBase workshop for PAG to be presented January 10, 2025. It very closely follows the KBase Case Study Narrative that was used in the Current Plant Biology publication A KBase case study on genome-wide transcriptomics and plant primary metabolism in response to drought stress in Sorghum..
This study takes transcription data from a Sorghum bicolor drought study. The original study from Varoquaux et al subjected the Sorghum RTx430 genotype to 8 weeks of water deprivation to simulate drought conditions.
There are 12 sets of RNA-seq reads used in this analysis. The conditions are well-watered controls (ww) and drought-stressed (dr) from each of the leaves and roots, with 3 replicates for each.
For full details, please see the paper linked above.
Some quick links to references are attached here.
First upload files to the "staging area." This is a temporary storage where raw files are held until processed. These files aren't usable in KBase per se and have to be read. The staging area is regularly purged to remove files older than 90 days.
To get the files read into a usable format, they must be imported. You can import either by selecting the appropriate import type from the dropdown or by using the spreadsheet import specification option.
Either option results in the same import cell. In my experience, CSV upload is faster once you get above 15-20 files.
Once the files are verified to be correct and the correct desired output object name is selected, click "Run" to start the import.
This will read the uploaded files as save the contents behind the scenes in the KBase internal representation.
In this case, we will import the Genome using the GFF+FASTA files by uploading them to staging, then importing together.
In the case of reads, the use case this demo is based on used the app Import SRA File as Reads From Web. If you have publicly available reads such as on NCBI's Sequence Read Archive, this app will combine the upload/import steps "under the hood" and upload them then import them all in one app.
In this demo, the reads are copied from the original Narrative.
The GFF provided from JGI does not have functional annotation, only structural. In the app below, we use OrthoFinder to provide the functional annotation.
At the moment, OrthoFinder in KBase is the only app for plant annotation and it does require structural annotations to be provided. We do not currently have an app to perform structural annotation.
The RNA-seq apps in KBase operate starting with an RNA-seq SampleSet. This object links the 12 reads libraries into groups that are operated on together. We'll have 4 condition labels (roots and leaves, well-watered and drought-stressed) each with 3 replicates.
Take care when creating SampleSets. Some apps rely on outputs of other apps which means if you make an error at this stage you may need to redo the entire chain.
We can run FastQC to assess the quality of our reads. Since this is a demo dataset with pre-cleaned reads, we already know the quality is good.
If these reads needed to be cleaned, we could use apps like Trimmomatic to process the reads. Use the "apps using this type as input" filter to quickly filter apps to those that take reads as input to find all apps that can operate on this type.
The first step of the RNA-seq workflow is to align with HISAT.
HISAT2 is normally the longest-running app in this pipeline.
HISAT2 produces one RNASeqAlignment per reads library (12 in this case) as well as 1 ReadsAlignmentSet. If you want to use these outputs in external apps or custom code, you can download the alignments as BAM or SAM files.
The app also produces a QualiMap report which can be viewed in a separate tab or window.
The ReadsAlignmentSet links all the RNASeqAlignments together for the next step.
In this step, we assemble the reads using StringTie.
StringTie produces expression objects for each alignment, which can be downloaded as a zip file containing several files. This is documented in by a dependency of StringTie, Ballgown, in their documentation.
The app also produces two ExpressionMatrices, TPM (transcripts per million) and FPKM (fragments per kilobase of transcript, per million fragments mapped). Both of these are downloadable as Excel/CSV.
As with alignment, it will also produce an ExpressionSet which is our input for the next step.
To find the average abundances for each gene in each condition, the normalized expression matrix is averaged across the biological replicates for each condition.
This average expression matrix “RTx430_sampleset_TPM_ExpressionMatrix_average” is used in a later step to assign reaction level expression scores to study plant primary metabolism.
The output ExpressionSet from StringTie feeds directly into DESeq2 for differential expression.
This app produces a DifferentialExpressionMatrix for each comparison which can be downloaded for further analysis.
This app allows us to subset the expression matrix to only consider features that are up- or down-regulated by a certain amount.
This doesn't do any new analysis but rather filters the existing matrix to a smaller set. It also creates FeatureSets. FeatureSets are groups of genes or other features inside the Genome object. A FeatureSet is used in some apps like BLAST to examine a smaller set of genes more closely.
The app Reconstruct Plant Metabolism app allows us to create a plant metabolic model based on the genome annotations performed earlier.
This type of model behaves similarly to bacterial/fungal metabolic models, for which we have detailed tutorials and documentation on our YouTube channel and docs.kbase.us.
This app allows us to map our previously calculated expression abundances in the expression matrix with the model constructed above.
Below I've run the app with all 3 drought leave expression matrices.
This lets us map the metabolic model visually using the Escher Pathway Viewer.
We can combine the model that we produced above with the expression data to show the difference in expression on the map.
Kumari, S., Kumar, V., Beilsmith, K., Seaver, S. M. D., Canon, S., Dehal, P., Gu, T., Joachimiak, M., Lerma-Ortiz, C., Liu, F., Lu, Z., Pearson, E., Ranjan, P., Riel, W., Henry, C. S., Arkin, A. P., & Ware, D. (2021). A KBase case study on genome-wide transcriptomics and plant primary metabolism in response to drought stress in Sorghum. In Current Plant Biology (Vol. 28, p. 100229). Elsevier BV. https://doi.org/10.1016/j.cpb.2021.100229
Kumari S, Kumar V, Beilsmith K, Seaver SMD, Canon S, Dehal P, et al. A KBase case study on genome-wide transcriptomics and plant primary metabolism in response to drought stress in Sorghum. Current Plant Biology. Elsevier BV; 2021. p. 100229. doi:10.1016/j.cpb.2021.100229
Varoquaux N, Cole B, Gao C, Pierroz G, Baker CR, Patel D, et al. Transcriptomic analysis of field-droughted sorghum from seedling to maturity reveals biotic and metabolic responses. Proceedings of the National Academy of Sciences. Proceedings of the National Academy of Sciences; 2019. pp. 27124–27132. doi:10.1073/pnas.1907500116
Ballgown documentation: https://github.com/alyssafrazee/ballgown