KBase’s expression analysis tools enable researchers to align and assemble RNA-seq reads into transcriptomes, analyze patterns of gene expression and identify differentially expressed genes, and visualize expression data from microarray or RNA-seq platforms. Expression data can also be integrated with KBase’s metabolic modeling tools to compare empirical gene expression data with metabolic models to explore differences in biological behavior and composition.
The RNA-seq analysis workflow in KBase typically consists of (i) mapping short sequence reads to the reference genome (ii) assembling the transcripts into full length transcripts and expression quantification and (iii) and differential analysis of the gene expression.
KBase provides a suite of Apps that allow users to run the tools from the popular Tuxedo RNA-seq suites to get the normalized full and differential expression matrix of the reads obtained from Illumina sequencing platforms using the reference genome. The original Tuxedo suite uses TopHat2, Cufflinks, and Cuffdiff, whereas the new Tuxedo suite uses HISAT2, StringTie and Ballgown for alignment of the reads to the reference genome, transcriptome profiling, and identification of differentially expressed genes (DEG). The RNA-seq Apps in KBase can be combined into multiple workflows, allowing users to select their choice of reads aligner and assembler for the differential gene expression analysis (see Figures 1,2). However, Ballgown does not work for prokaryotes due to its dependency on introns.
Figures 1 and 2 (above): The original and new Tuxedo RNA-seq analysis suites in KBase have modular Apps for building flexible prokaryotic (Figure 1) and eukaryotic (Figure 2) analysis workflows.
Figure 3 (above): KBase RNA-seq analysis workflow using original Tuxedo suite (Bowtie2/TopHat2, Cufflinks, Cuffdiff) and new Tuxedo suite (HISAT2, StringTie, Ballgown).
As of June 2017, KBase has 63 plant genomes comprising 41 different plant species imported from Phytozome, as well as 225 fungal genomes and 26,852 microbial genomes imported from NCBI RefSeq. KBase’s plant reference data includes multiple versions of some genomes to allow users to compare different genome assemblies and annotations and choose the version they want to analyze. KBase also allows users to upload genomes to for analysis. Sequence files in gff, FASTA or GenBank format can be uploaded from the web (via an FTP or HTTP URL) or a user’s computer. Please see the Data Upload/Download Guide for more information.
KBase provides multiple ways to upload NGS reads and perform quality control (QC). Users can upload single or paired-end read files to their KBase account from their computer or from an online site with a publicly available URL (FTP, HTTP, Dropbox or Box). QC of sequence data generated from these technologies is extremely important for meaningful downstream analysis. Currently, KBase provides FastQC for quality check and CutAdapt & Trimmomatic for adapter cleaning and removal of poor quality reads, Running these apps on reads data helps to improve the accuracy of subsequent analyses.
KBase has incorporated three different alignment algorithms for mapping short reads to the reference genome: Align Reads using Bowtie2, Align Reads using TopHat2, and Align Reads using HISAT2. TopHat2 and HISAT2 are splice aligners, and can identify known and novel exon-exon splicing junctions in eukaryotes whereas Bowtie2 only does unspliced alignment and is preferred for prokaryotic genomes. The alignment output object generated by these aligners can be downloaded for analysis outside of KBase (e.g., for estimation of mapping quality).
KBase currently has two apps to assemble the genes for each dataset separately, and estimate the gene level abundance: Assemble Transcripts using Cufflinks and Assemble Transcripts using StringTie. Both Cufflinks and StringTie provide downloadable normalized full expression matrices in FPKM (fragments per kilobase of exon model per million mapped reads) and TPM (transcripts per million) format. The RNA-seq expression object generated by StringTie also provides additional read-count data and gene-count matrix that are used by Ballgown for detecting differential gene expression.
Differential expression analysis is the most typical application of RNA-seq. It can be used to identify differential gene expression signatures between conditions; identify differences between tissues, conditions, genetic backgrounds; and identify the molecular markers. KBase provides several differential gene expression analysis tools. Create Differential Expression Matrix using Cuffdiff and Create Differential Expression Matrix using Ballgown take the genes and expression levels from Cufflinks and StringTie and apply rigorous statistical methods (q-value and fold change) to determine which genes are differentially expressed between two or more experimental conditions. These Apps generate a differential gene expression matrix based on the user-specified threshold cutoff parameters and also generate static plots that help visualize the results. In addition, the Interactive Volcano Plot visualization App allows users to select appropriate q-value and fold change cutoffs to help fine-tune the threshold cutoff as an input parameter for the differential expression analysis Apps.
The expression matrix or differential expression matrix generated by “Create Differential Expression Matrix using Cuffdiff and Create Differential Expression Matrix using Ballgown Apps. can be used in downstream analysis to analyze patterns of gene expression by grouping expression data via different clustering algorithms based Apps such as Cluster Expression Data – Hierarchical, Cluster Expression Data – K-Means and Cluster Expression Data – WGCNA . The clusters generated by these Apps can be viewed as a heat map using the Interactive View HeatMap App.
Differential gene identifiers
Filter expression matrix