The KBase RNA-seq Service provides a number of data analysis tools (Apps) that are based on the original and new Tuxedo suites of RNA-seq tools. The original Tuxedo suite consists of Bowtie2 and TopHat2 to align the reads, Cufflinks to assemble the transcripts, Cuffdiff to identify the differentially expressed genes and CummeRbund to visualize the differentially expressed genes as 2D plots and heatmaps [1,2,3]. The new Tuxedo suite uses HISAT2 instead of (Bowtie2/TopHat2) to align reads and StringTie instead of Cufflinks to assemble the transcripts [4,5]. A significant improvement over their predecessors, both HISAT2 and StringTie are fast and efficient with low memory footprints [4,5]. Additionally, the modularity of the RNA-seq Apps in KBase provides several options that allow users to select their choice of reads aligner and assembler for the differential gene expression analysis.
You can copy these tutorials and re-run any of the steps (perhaps changing parameters or using your own data) in your KBase account. You will need a KBase account in order to view and copy the Narratives.
The figure below shows how the RNA-seq tools from the Original and New Tuxedo suites can be chained together in a workflow in KBase to study differential gene expression. Each step in this workflow is then described in detail.
The Differential Gene Expression Workflow using the RNA-seq Apps in KBase
Step 1: Import RNA-seq Data into the Narrative
The RNA-seq pipeline starts with importing high-throughput reads obtained from Illumina or SOLiD sequencing platforms and the corresponding reference genome in the Narrative interface.
1.1 Short reads: The reads must be a set of single-end or paired-end reads in FASTA or FASTQ format. Use the Bulk Uploader to import the short reads into your KBase account. If you don’t have your own reads, you can try out the RNA-seq tools by selecting a set of example reads (trimmed down in size) from the Public tab in the Data Browser and adding them to your Narrative (see http://kbase.us/narrative-guide/add-data-to-your-narrative/).
1.2 Genomes: Import the appropriate reference genome from the Public tab in the Data Browser. You must run the Build Bowtie2 Index App to index the genome if you intend to use Bowtie2 or TopHat2 aligners. However, the HISAT2 aligner only needs the relevant reference genome (no genome indexing needed).
Step 2: Create RNA-seq Sample Set
This App allows you to associate the experiment metadata to the input sequence-reads and generate the RNA-seq Sample Set object that is required by the next step for a set of samples.
Step 3: Align Reads to the Reference Genome using Bowtie2/TopHat2/HISAT2
KBase provides three different Apps that can be used to align the RNA-seq Sample Set to a prokaryotic or eukaryotic genome. You can use one or more than one of these Apps (Bowtie2/TopHat2/HISAT2) to align the reference genome based on your research experiment and compare the alignment results. Bowtie2 or TopHat2 Apps need Bowtie2 indexed genome to generate the read alignments whereas HISAT2 uses only the reference genome for alignment. HISAT2 is faster and more sensitive than Bowtie2/TopHat2 and also uses less memory.
NOTE: Even though this App is one of the sequential steps in the KBase RNA-seq Pipeline, it can also be run as a standalone analysis tool for one or more RNA-seq samples.
Step 4: Assemble Transcripts with Cufflinks/StringTie
KBase provides two different Apps that can be used to assemble the alignments into a parsimonious set of transcripts. The RNASeqAlignmentSet obtained from any one of the Bowtie2/TopHat2/HISAT2 Apps can be used as an input to either Cufflinks or StringTie App to generate GTF and FPMK files that are subsequently wrapped as an RNAseqExpression object in KBase for each individual sample and an RNASeqExpressionSet object for the whole SampleSet. These Apps also generate fully normalized FPKM/TPM ExpressionMatrix objects that can be downloaded or used as input to downstream analysis tools.
NOTE: Due to the modular nature of these Apps, KBase provides four different options to run this step. Based on your interest, you can choose any one of the following options:
Step 5: Identify Differential Expression using Cuffdiff
This App uses the RNASeqExpressionSet data object obtained from either the Cufflinks or StringTie Apps to calculate gene and transcript expression levels in more than one condition and identifies the significant changes in the expression levels. Cuffdiff calculates the FPKM value of each transcript, primary transcript and gene in each sample and produces a number of output files zipped into the Cuffdiff output as a RNASeqDifferentialExpression data object.
NOTE: Steps 6-8 below take Cuffdiff output as input and generate plots and/or expression matrices.
Step 6: View CummeRbund Plots
This App takes Cuffdiff output as input and generates a number of plots for the exploration, analysis and visualization of high-throughput RNA-seq data.
Step 7: View Interactive Volcano Plot
This App generates an interactive Volcano Plot (2D scatter plot) to show the list of differentially expressed genes based on the fold change and p value.
Step 8: Create Expression Matrix from Cuffdiff
This App creates an expression matrix based on the data obtained from the Cuffdiff App. The advanced options can be used to select the different matrix transformation for normalization that can be filtered by alpha cutoff, fold change, and number of genes.
Step 9: View differentially expressed genes from Cuffdiff in HeatMap
This App compares a pair of conditions in RNA-seq expression data to identify differentially expressed genes and view them in an interactive heatmap. It uses the data produced from Cuffdiff RNA-seq differential expression analysis as input and creates a heatmap of differentially expressed genes that can be filtered by alpha cut off, fold change, and number of genes.
The expression matrix generated by the RNA-seq workflow can be used in downstream analysis by other Apps in KBase. For example, you can analyze patterns of gene expression by grouping expression data via different clustering algorithms such as Hierarchical, K-means and WGCNA. It can also be used in metabolic modeling Apps in KBase to compare reaction flux with gene expression to identify the pathways where expression and flux agree or conflict.
 Trapnell C, Pachter L, Salzberg SL. (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. Vol 25, 9:1105-1111. http://bioinformatics.oxfordjournals.org/content/25/9/1105.abstract
 Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter, L (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols, 7(3), 562 578. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3334321/
 Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology. 14:R36 http://www.genomebiology.com/2013/14/4/R36/abstract
 Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT & Salzberg SL (2015) StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature Biotechnology http://www.nature.com/nbt/journal/v33/n3/full/nbt.3122.html
 Pertea M, Kim D, Pertea G, Leek JT and Salzberg SL (2016) Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie, and Ballgown. Nature Protocols 11, 1650–1667. http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html