Classify Taxonomy of Metagenomic Reads with Kaiju - v1.9.0

Allows users to perform taxonomic classification of shotgun metagenomic read data with Kaiju.

This App makes the tool Kaiju: Fast and sensitive taxonomic classification for metagenomics available through KBase. Kaiju is written by Peter Menzel and Anders Krogh at the Bioinformatics Centre, a part of the Section for Computational and RNA Biology at the University of Copenhagen.

From the Kaiju homepage:

Kaiju is a program for sensitive taxonomic classification of high-throughput sequencing reads from metagenomic whole genome sequencing or metatranscriptomics experiments.

Each sequencing read is assigned to a taxon in the NCBI taxonomy by comparing it to a reference database containing microbial and viral protein sequences. By using protein-level classification, Kaiju achieves a higher sensitivity compared with methods based on nucleotide comparison.

Kaiju can use either the set of available complete genomes from NCBI RefSeq or the microbial subset of the NCBI BLAST non-redundant protein database nr, optionally also including fungi and microbial eukaryotes.

Reads are translated into amino acid sequences, which are then searched in the database using a modified backward search on a memory-efficient implementation of the Burrows-Wheeler transform, which finds maximum exact matches (MEMs), optionally allowing mismatches in the protein alignment.

The search can process up to millions of reads per minute using, for example, only 10 GB RAM with a reference database comprising 4821 complete microbial genomes.

Kaiju offers at least four reference databases for classification, which are downloaded from the Kaiju webserver page (last updated early 2022). The databases are:

RefSeq Complete Genomes: protein sequences from completely assembled bacterial, archaeal, and viral genomes from NCBI RefSeq. Updated: 23-Mar-2022.
proGenomes: protein sequences from a representative set of genomes derived from NCBI RefSeq bacterial, archaeal, and viral genomes. Updated: 02-Mar-2021.
NCBI BLAST nr: protein sequences from nr: Bacteria, Archaea, and Viruses. Updated: 10-Mar-2022.
NCBI BLAST nr+euk: protein sequences from nr: Bacteria, Archaea, Viruses, Fungi and microbial eukaryotes. Updated: 10-Mar-2022.
Viruses: protein sequences from a representative set of viral genomes. Updated: 29-Mar-2022.
Plasmids: protein sequences from a representative set of plasmids. Updated: 10-Apr-2022.
Reference Viral DataBase (RVDB): protein sequences from the Reference Viral Database (RVDB). Updated: 07-Apr-2022.
Fungi: protein sequences from a representative set of fungal genomes. Updated: 29-Mar-2022.

Subsampling

Large datasets can take a long time to process, and there are situations where it is worth the wait. Sometimes, however, users just want a sample of how the App works or only want the higher taxonomic levels. At the higher taxonomic levels, the results are just as good when you run against a small fraction of the data, and it is much faster. The ability to randomly subsample reads was added as a preprocessor to running the Kaiju App. This function can greatly speed up the App for those situations where the it is being tested or only used for high taxonomic levels. See Randomly Subsample Reads for more information on the subsampling process.

Notes

The summary table of taxon abundances now offers both one with the long tail of low abundance taxa collapsed into one grouping, as well as a table file with the long tail unmerged. This latter file is available in the kaiju_summaries.zip downloadable archive and can be identified with the addition to the summary file name of "-longtail". Plots are still generated from the collapsed longtail to limit excessive taxa display and avoid breaking the plot.
Kaiju v1.9.0 updates: the default mode is now greedy. If you wish to run "-a mem" mode this must be specified (but you cannot specify "-a greedy" on the command line). Default values for thresholds have changed for greedy mode to -e max_mismatches=3, -s min_match_score=65, and -E max_e-value=0.01.
Kaiju v1.7.2 updates: flags -i and -j now required to run kaiju binary; kaijuReport renamed to kaiju2table
Krona Snapshots: It may be that you will not be able to take a snapshot of the Krona plot. This is a known issue with Krona for some versions of Chrome and Firefox on Windows 7 and 10. To remedy this, we suggest trying it with a different browser.

Team members who wrapped the app for KBase: Dylan Chivian (lead), Sean Jungbluth. For questions, please use the Help Board.

Related Publications

Chivian D, et al. Metagenome-assembled genome extraction and analysis from microbiomes using KBase. Nat Protoc. 2023 Jan;18(1):208-238. doi: 10.1038/s41596-022-00747-x , https://pubmed.ncbi.nlm.nih.gov/36376589
Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun. 2016;7: 11257. doi:10.1038/ncomms11257 , http://www.ncbi.nlm.nih.gov/pubmed/27071849
Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics. 2011;12: 385. doi:10.1186/1471-2105-12-385 , http://www.ncbi.nlm.nih.gov/pubmed/21961884
Kaiju Homepage: , http://kaiju.binf.ku.dk/
Kaiju DBs from: , http://kaiju.binf.ku.dk/server
Github for Kaiju: , https://github.com/bioinformatics-centre/kaiju
Krona homepage: , https://github.com/marbl/Krona/wiki
Github for Krona: , https://github.com/marbl/Krona

App Specification:

https://github.com/kbaseapps/kb_kaiju/tree/83aa257ccd7d0391e118c6f41f6410319954c376/ui/narrative/methods/run_kaiju

Module Commit: 83aa257ccd7d0391e118c6f41f6410319954c376