BFC - Bloom Filter Read Error Correction

Error correction for short illumina reads

This is a KBase wrapper for the BFC Illumina short read error correction tool.

The BFC, Bloom Filter Correction, algorithm is a fast and easy-to-use error correcting (EC) tool for sequencing errors in Illumina short reads data. It uses a non-greedy algorithm, but still maintains a speed comparable to implementations based on greedy methods. Given the greedy algorithm approach of applying corrections based on the local sequence context and never reverting the correction decision once it's been made; the greedy algorithm strategy will thus not be as accurate when filtering repetition rich genomes, such as diploid genomes. In evaluations on real data, BFC appears to correct more errors with fewer overcorrections in comparison to existing tools. It particularly does well in suppressing systematic sequencing errors, which helps to improve the base accuracy of de novo assemblies.

The BFC algorithm is a variant of the classical spectrum alignment algorithm. Given a read, it uses an exhautive search to find a k-mer path through an optimal read. First, it finds the longest substring on which each k-mer is a trusted k-mer. It then extends both ends of the read with the substring. If a read does not contain any trusted k-mers substrings, then all k-mers that are one mismatch away from the first k-mer on the read are iterated through to find a trusted k-mer. In this way it finds a trusted k-mer path through a read that minimizes the heuristic objective function of the algorithm, while jointly considering penalties on the correction, quality, and k-mer support. The read is marked as uncorrectable if none or multiple trusted k-mers are found in this way. As an EC tool, the BFC algorithm is a high-performance standalone tool for correcting sequencing errors from Illumina sequencing data.

BFC version: r181

The option to drop reads with unique kmers is selected by default. Thus, any unpaired reads will also by default be removed.

The output data of the BFC tool is: your corrected read, the number of reads filtered, the number of output reads, and the k-mer sized used.

Related Publications

Li H. BFC: correcting Illumina sequencing errors. Bioinformatics. 2015 Sep 1;31(17):2885-7. , https://www.ncbi.nlm.nih.gov/pubmed/25953801

App Specification:

https://github.com/psdehal/kb_bfc/tree/bd4c653197f1b899067694dc18220b6a05734273/ui/narrative/methods/run_bfc

Module Commit: bd4c653197f1b899067694dc18220b6a05734273