Assemble with ARAST

Assemble DNA reads into a set of contigs (an Assembly object) using the ARAST Assembly Service.

This is App is now inactive according to the KBase Policy for App Deprecation as it is no longer supported by the developer.

This app can be used to perform an automatic genome assembly using the latest computational tools. Single or multiple assemblers can be invoked to compare results. Resulting assemblies are automatically processed via a collection of analysis tools developed by both KBase and the research community. The app attempts to select the best assembly (the smallest number of contigs, the longest average contig length) to suggest to the user.

Several assembly workflows or "recipes" are available. These workflows have been tuned and tested to fit certain dataset types or desired analysis criteria such as throughput or rigor. The compute engine's flexible nature also enables the rapid design and emulation of other popular protocols.

Additionally, custom workflows can be designed and executed in "pipeline" mode without having to compose complicated scripts. Workflows can be composed with combinations of quality filtering or trimming, error correction, adapter removal, assembly, scaffolding, or post-processing.

Assembly Recipe Descriptions:

Automatic Assembly:

Provides a nice balance between "fast pipeline" and "smart pipeline"
Runs BayesHammer on reads
Assembles with Velvet[25], IDBA[20] and SPAdes[2]
Sorts assemblies by ALE score[7]

Fast Pipeline:

Assembles with A6[1], Velvet[25] and SPAdes[2] (with BayesHammer for error correction)
Results are sorted by ARAST quality score

Smart Pipeline:

Runs BayesHammer[19] on reads, KmerGenie[5] to choose hash-length for Velvet[25]
Assembles with Velvet[25], IDBA[20] and SPAdes[2]
Sorts assemblies by ALE score[7]
Merges the two best assemblies with GAM-NGS[24]

Kiki assembler[15]:

Runs Kiki assembler

Assembly Recipe Descriptions:

sspace: SSPACE pre-assembled contig scaffolder[3]
default values:
extend: False
minimum_overlap: 15
a: 0.4
m: -1
n: -1
k: -1
x: 0
trim_sort: DynamicTrim and LengthSort from SolexaQA[8]
default values:
probcutoff: 0.05
length: 25
filter_by_length: Length-based sequencing reads filter and trimmer based on seqtk[11]
default values:
min: 250
end: 200
sync: True
KmerGenie: Informed and automated k-mer size selection for genome assembly[5]

bwa: BWA aligner that maps reads to contigs[18]

velvet: Velvet de-bruijn graph based assembler[25]
default values:
hash_length: 29
auto_insert: False

masurca: MaSuRCA assembler based on a hybrid graph & overlap based algorithms[26]
default values:
graph_k-mer_size: auto
use_linking_mates: auto
limit_jump_coverage: 60
ca_parameters: ovlMerSize=30 cgwErrorRate=0.25 ovlMemory=4GB
k-mer_count_threshold: 1/br> num_threads: auto
jf_size: 2000000000
do_homopolymer_trim: 0

sga_preprocess: SGA component for preprocessing reads (runs subcommand 'preprocess')[21]
default values:
quality_trim: 10
quality_filter: 20
min_length: 29
permute_ambiguous: True

bhammer: SPAdes component for quality control of sequence data[19]

idba: IDBA iterative graph-based assembler for single-cell and standard data[20]

default values:
max_k: 50
scaffold: True

prodigal: Prodigal microbial gene predictor[14]

fastqc: FastQC quality control tool for sequence data[10]

ray: Ray graph-based parallel microbial and metagenomic assembler[4]
default values:
k: 31

swap: SWAP Assembler[22]

gam_ngs: GAM-NGS genomic assemblies merger[24]

ale: ALE likelihood-based estimator of assembly quality[7]

kiki: Kiki overlap-based parallel microbial and metagenomic assembler[15]
default values:
k: 27
contig_threshold: 800

a5: A5 microbial assembly pipeline[23]

tagdust: TagDust sequencing artifacts remover[17]

sga_ec: SGA component for error correction (runs subcommands: 'index' & 'correct')[18]

pacbio: PacBio non-hybrid assembly pipeline for SMRT long reads [6]

reapr: REAPR assembly error recognizer using paired-end reads[13]

a6: Modified A5 microbial assembly pipeline[1]

quast: QUAST assembly quality assessment tool (run by default)[12]

bowtie2: Bowtie2 aligner that maps reads to contigs[16]

discovar: Discovar assembly pipeline for illumina 250+ bp reads[9]

Team members who developed & deployed algorithm in KBase: Chris Bun, Fangfang Xia. For questions, contact us.

Related Publications

[1] A6 Github source: , https://github.com/levinas/a5
[2] Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19: 455 477. doi:10.1089/cmb.2012.0021 , https://www.liebertpub.com/doi/10.1089/cmb.2012.0021
[3] Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2011;27: 578 579. doi:10.1093/bioinformatics/btq683 , https://academic.oup.com/bioinformatics/article/27/4/578/197626
[4] Boisvert S, Raymond F, Godzaridis , Laviolette F, Corbeil J. Ray Meta: scalable de novo metagenome assembly and profiling. Genome Biology. 2012;13: R122. doi:10.1186/gb-2012-13-12-r122 , https://genomebiology.biomedcentral.com/articles/10.1186/gb-2012-13-12-r122
[5] Chikhi R, Medvedev P. Informed and automated k-mer size selection for genome assembly. Bioinformatics. 2014;30: 31 37. doi:10.1093/bioinformatics/btt310 , https://academic.oup.com/bioinformatics/article/30/1/31/235479
[6] Chin C-S, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods. 2013;10: 563 569. doi:10.1038/nmeth.2474 , https://www.nature.com/articles/nmeth.2474
[7] Clark SC, Egan R, Frazier PI, Wang Z. ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies. Bioinformatics. 2013;29: 435 443. doi:10.1093/bioinformatics/bts723 , https://academic.oup.com/bioinformatics/article/29/4/435/199222
[8] Cox MP, Peterson DA, Biggs PJ. SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics. 2010;11: 485. doi:10.1186/1471-2105-11-485 , https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-485
[9] Discovar source: , https://software.broadinstitute.org/software/discovar/blog/
[10] FastQC source: , http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
[11] Filter by Length GitHub source: , https://github.com/levinas/seqtk
[12] Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29: 1072 1075. doi:10.1093/bioinformatics/btt086 , https://academic.oup.com/bioinformatics/article/29/8/1072/228832
[13] Hunt M, Kikuchi T, Sanders M, Newbold C, Berriman M, Otto TD. REAPR: a universal tool for genome assembly evaluation. Genome Biology. 2013;14: R47. doi:10.1186/gb-2013-14-5-r47 , https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-5-r47
[14] Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11: 119. doi:10.1186/1471-2105-11-119 , https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-119
[15] Kiki GitHub source: , https://github.com/GeneAssembly/kiki
[16] Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9: 357 359. doi:10.1038/nmeth.1923 , https://www.nature.com/articles/nmeth.1923
[17] Lassmann T, Hayashizaki Y, Daub CO. TagDust a program to eliminate artifacts from next generation sequencing data. Bioinformatics. 2009;25: 2839 2840. doi:10.1093/bioinformatics/btp527 , https://academic.oup.com/bioinformatics/article/25/21/2839/227883
[18] Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25: 1754 1760. doi:10.1093/bioinformatics/btp324 , https://academic.oup.com/bioinformatics/article/25/14/1754/225615
[19] Nikolenko SI, Korobeynikov AI, Alekseyev MA. BayesHammer: Bayesian clustering for error correction in single-cell sequencing. BMC Genomics. 2013;14: S7. doi:10.1186/1471-2164-14-S1-S7 , https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-S1-S7
[20] Peng Y, Leung HCM, Yiu SM, Chin FYL. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28: 1420 1428. doi:10.1093/bioinformatics/bts174 , https://academic.oup.com/bioinformatics/article/28/11/1420/266973
[21] Simpson JT, Durbin R. Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 2012;22: 549 556. doi:10.1101/gr.126953.111 , https://genome.cshlp.org/content/22/3/549.abstract
[22] SWAP-Assembler source: , https://sourceforge.net/projects/swapassembler/
[23] Tritt A, Eisen JA, Facciotti MT, Darling AE. An Integrated Pipeline for de Novo Assembly of Microbial Genomes. PLOS ONE. 2012;7: e42304. doi:10.1371/journal.pone.0042304 , https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0042304
[24] Vicedomini R, Vezzi F, Scalabrin S, Arvestad L, Policriti A. GAM-NGS: genomic assemblies merger for next generation sequencing. BMC Bioinformatics. 2013;14: S6. doi:10.1186/1471-2105-14-S7-S6 , https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-S7-S6
[25] Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18: 821 829. doi:10.1101/gr.074492.107 , https://genome.cshlp.org/content/18/5/821
[26] Zimin AV, Mar ais G, Puiu D, Roberts M, Salzberg SL, Yorke JA. The MaSuRCA genome assembler. Bioinformatics. 2013;29: 2669 2677. doi:10.1093/bioinformatics/btt476 , https://academic.oup.com/bioinformatics/article/29/21/2669/195975

App Specification:

https://github.com/kbaseapps/ARAST_SDK/tree/056582c691c4df190110b059600d2dc2a3a8b80a/ui/narrative/methods/run_arast

Module Commit: 056582c691c4df190110b059600d2dc2a3a8b80a