In KBase, reads from FASTQ and SRA files can be imported to create reads library data objects. The objects will either be a SingleEndLibrary or a PairedEndLibrary. The tools in KBase can then be used to assemble reads into an “Assembly” data object or to align reads to an “Assembly”. After uploading and importing reads data, you may want to refer to the documentation about Assembly and Annotation. Reads can also be used in RNA-seq and expression analysis.
Single-end and paired-end reads can be uploaded in FASTQ or SRA format. For FASTQ files, please ensure that your filename ends with the .fastq, .fnq, or .fq file extension. SRA files should have an extension of .sra. The uploader also accepts compressed files in these formats: .zip, .gz, .bz2, .tar.gz, .tar.bz2.
Files can be uploaded into your KBase staging area from your local computer or directly from a publicly accessible FTP or HTTP URL.
For this example, we will assume that you have a local copy of the RNA transcripts of the sample SRR228087 from GenBank. This is a single-end library from Illumina sequencing. Instructions for obtaining a local copy of data from the GenBank SRA with their sratoolkit are available here and here. Other methods for obtaining the data will vary from one data provider to the next.
Once the file is on your computer, open the new Import tab in the Data Slideout and drag the single-end library into your Staging area.
Open the pulldown menu to the right of the filename in your staging area and select “FASTQ Reads”:
Now click the import icon (up arrow) to the right of “FASTQ Read”. The data slideout will close and an app called “Import FASTQ/SRA File as Reads from Staging Area” will be added to your Narrative.
Notice that the name of the FASTA/FASTQ file is already filled in, as is a suggested name for the Reads object that will be created by the import (you can change that if you like). Adjust the Sequencing Technology or any of the advanced options if needed. If this had been a metagenomic sample, we would uncheck the box next to Single Genome. When ready, click the green Run button to start the import. When the import is finished, your Data Panel will update to show the new SingleEndLibrary object, and a report will appear in the import app cell.
There are two ways that KBase and GenBank SRA recognize a paired-end library. In the legacy format, a paired-end library is two files which typically have the same name but have _1 and _2. For example, ERR760546_1.fastq and ERR760546_2.fastq. The other recognized format is called Interleaved. It is an 8-line format where forward and reverse reads alternate. The example above was imported as a SingleEndLibrary object because there was a single input file and the Interleaved box was un-checked.
In this example, we will upload and import a paired-end libraray for ERR760546 in the 2-file legacy format. Open the new Import tab in the Data Slideout and drag the two files into your Staging area.
Open the pulldown menu to the right of the filename in your Staging Area and select “FASTQ Reads” for the first file in the pair. Then click the import icon (up arrow) to the right of “FASTQ Reads”. The data slideout will close and an app called “Import FASTQ/SRA File as Reads from Staging Area” will be added.
Notice that the name of the FASTA/FASTQ file is already filled in, as is a suggested name for the Reads object that will be created by the import (you can change that if you like).
You now need to fill in the name of the second file. In the line for “Reverse/Right FASTA/FASTQ File Path”, type in the name of the second file. There is no pulldown list or other help to get the name of the file right. Luckily, the name is usually a slight variation of the first name.
As with the single-end library example, you can make adjustments to the available options. Adjust the Sequencing Technology or any of the advanced options if needed. If this had been a single paired-end library, we would have checked the box to the right of Interleaved.
When ready, click the green Run button to start the import. When the import is finished, your Data Panel will update to show the new PairedEndLibrary object, and a report will appear in the import app cell.
If you get an error because you had a typo in the name of the second file, it is easy to correct. Click the Reset button in the app cell to allow you to change fields in the app, make the correction, and click the green Run button again.
The Reads import can handle gzipped (.gz) input files. However, .zip files require special handling and .Z files are not yet supported by the importers (we are working on adding that). You can upload a zip file to your Staging Area, but you need to click the “uncompress” button to its left (the one with the diagonal arrows) to unzip it before trying to import it.
In the Staging area, beneath the box for Drag and Drop, there are other options for adding data to your staging area. You can import reads into KBase using Globus Online, or by supplying a URL for a publicly accessible FTP location, Google Drive, Dropbox, or a direct HTTP link.
There is also an icon with two arrows in a circle that will refresh the list of genomes that have been uploaded to your staging area.
If your reads are in a publicly accessible URL, you can bypass the Staging area and directly import reads into your Narrative using one of these three apps (which you can find in the Apps panel or the App Catalog):
(Note that these app names may change in the near future to be more consistent.)