Generated August 14, 2020

What is Delftia?

delftia%20cluster%20and%20protein.png Delftia is a genus of bacteria with a bunch of cool features! The best studied species, Delftia acidovorans can produce gold nanoparticles from gold ions in solution. This bacteria has been found living in biofolms with Cupriavidis metallidurans on gold nuggets(1). It is also found in soil, in sinks and in rhizospheres of different plants where it promotes their growth(3).

Delftia acidovorans forms gold nuggets by producing a short nonribosomal peptide called delftibactin (1). The 16 genes responsible for the production of delftibactin are called the del cluster (delA-delP) (1). These genes were originally discovered in Delftia acidovorans SPH-1, but our research shows that the del cluster appears to be present across the genus (1).

http://2013.igem.org/Team:Heidelberg/Project/Delftibactin

The Phylogenetic Classification of Delftia

Proteobacteria

Betaproteobacteria

Burkholderiales

Comamonadaceae

Delftia

How will we find Delftia?

Our Workflow:

Narrative%20Methods%20Figure.png

Each step will be outlined before you reach it with a description of the apps, parameters and results.

We'll start with a number of raw reads, clean them up and do a little taxonomy to figure out what we might be able to assemble. Next, we'll assemble the reads using 4 different methods and pick out the best assembly. Then we'll take those reads and sort them by genome and assess the quality of those genomes. We'll select the high quality genomes to annotate and insert into phylogenetic trees to find relatives.

Learning Objectives:

After completing this narrative, students will be able to:

  • Describe the steps of assembling metagenomic data into genomes and the importance of each step.
  • Define the following terms and processes, read trimming, contig(s), bin(s), binning, MAG, L50, and N50.
  • Interpret the results of FastQC and CheckM reports.
  • Explain why multiple assemblers are used.
  • Compare and contrast assembly statistics to determine the best assembly to generate MAGs.
  • Identify high quality bins to extract and annotate.

What samples am I searching for Delftia?

Sample 1: Metagenome generated from fracture fluid collected July 12-27, 2012 from a borehole located on the 26 level of Beatrix Gold Mine (Welkom, South Africa) 1,339 m below land surface.

Bio Sample: SAMN04419121; Sample name: Be326_2012_DNA_MF; SRA: SRS1792548 Link to more information: https://www.ncbi.nlm.nih.gov/biosample/SAMN04419121

Why did I pick this sample?

Delftia acidovorans and Cupriavidus metallidurans make up 90% of the bacteria found in biofilms on gold nuggets (1). This sample is from a gold mine, so it would be interesting to see if Delftia is also present here. Additionally, the Krona plot indicated that there were some (unassembled) sequences that resembled Delftia.

Sample 2: Hydraulically fractured gas well metagenomes fluid from sample at timepoint_82

Bio Sample: SAMN04417545; Sample name: Timepoint_82; SRA: SRS1256393 Link to more information: https://www.ncbi.nlm.nih.gov/biosample/SAMN04417545

Why did I pick this sample?

Delftia is often found in soil and water. It is capable of using a diverse array of carbon sources as well. (That's why so many studies focus on it's use for bioremediation.) In this case the Krona plot indicated a small portion of potential Delftia sequences.

Sample 3: Subsurface sediment microbial communities from gas well in Oklahoma, United States - OK STACK MC-FT3-sol metagenome

Bio Sample: SAMN09199659; SRA: SRS3667068; DOE Joint Genome Institute: Gp0290895 Link to more information: https://www.ncbi.nlm.nih.gov/biosample/SAMN09199659

Why did I pick this sample?

I stumbled upon this sample while I was searching through the database. The Krona plot showed a much greater proportion of genes that could potentially be from Delftia, and I couldn't skip it. Delftia is often found in soil and/or sediment.

Sample 4: Metagenome of iron plaque on rice root from As contaminated paddy soil, sample from Yanhong

Bio Sample: SAMN07211852; Sample name: YanhongMeta01; SRA: SRS2392319 Link to more information: https://www.ncbi.nlm.nih.gov/biosample/SAMN07211852

Why did I pick this sample?

I choose this sample because it brings together soil and aquatic environments that Delftia can be found in and is associated with heavy metals. The Krona plot indicated some Delftia-associated sequences are present from several species of Delftia.

Sample 5: Peat soil microbial communities from Stordalen Mire, Sweden - IR.F.S.T-25

Bio Sample: SAMN09201211; SRA: SRS3568559,DOE Joint Genome Institute: Gp0256443, Link to more information: https://www.ncbi.nlm.nih.gov/sra/SRX4415252[accn]

Why did I pick this sample?

I picked this sample because soil is one of the common sources of Delftia. The Krona plot looked promising as well for both Delftia acidovorans and Delftia tsuruhatensis.

Step 1. Import Metagenomic Data

The data I'm using comes from publicly available datasets available from NCBI's Sequence Read Archive (SRA).

The sample you need to upload should be a metagenomic sample listed as a WGS having paired reads. Your sample cannot be more than 20G bases or you'll need to import it through Globus (see this link for more information). You can also check out this really informative narrative for an in depth view of how to upload data from other sources: https://narrative.kbase.us/narrative/48493

App: Import SRA File as Reads from Web

Timing: 1-5 hours depending on the size of the file and queue time. (Its often helpful to run this in the morning so it uploads and then you can set up the assembly to run overnight or over the weekend.)

View Configure:

SRA URL: Use the first link from the "Reads Access" page and paste it into this block.

Reads Object Name: this will be what your sample is called once it's been uploaded into KBase. Make sure its something you'll be able to remember and follow the workflow with.

Sequencing Technology: Select the sequencing technology that was used to call the reads.

Single Genome: These reads are all metagenomes, so you don't need to select this box.

Results: Your reads should appear in the Data panel to the left. Before proceding, double check that they are a PairedEndLibrary. If they are a SingleEndLibrary, you'll need to select a different sample, because your assembly won't work with a SingleEndLibrary. In the results panel itself you'll see some stats from the reads including the number of reads, quality score mean and mean read length.

Import an SRA file from a web URL into your Narrative as a Reads data object.
This app completed without errors in 45m 20s.
Objects
Created Object Name Type Description
Subsurface_gold_mine_reads PairedEndLibrary Imported Reads
Links
Import an SRA file from a web URL into your Narrative as a Reads data object.
This app completed without errors in 53m 19s.
Objects
Created Object Name Type Description
Hydraulic_fracture_well_fluid_raw_reads PairedEndLibrary Imported Reads
Links
Import an SRA file from a web URL into your Narrative as a Reads data object.
This app completed without errors in 39m 34s.
Objects
Created Object Name Type Description
Subsurface_gas_well_reads PairedEndLibrary Imported Reads
Links
Import an SRA file from a web URL into your Narrative as a Reads data object.
This app completed without errors in 46m 56s.
Objects
Created Object Name Type Description
Rice_root_iron_plaque_reads PairedEndLibrary Imported Reads
Links
Import an SRA file from a web URL into your Narrative as a Reads data object.
This app completed without errors in 1h 22m 38s.
Objects
Created Object Name Type Description
Peat_soil_raw_reads PairedEndLibrary Imported Reads
Links

Step 2. Assess the Quality of your Samples

You've uploaded your data, great! Now you have to check the quality of the reads. First assess quality with FastQC. Then, if you need to, trim the reads and assess quality again to make sure the low quality reads have been removed from the sample.

Apps:

  1. Assess Read Quality with FastQC
  2. Trim Reads with Trimmomatic

Timing: 5-20 minutes depending on queue and the number of reads

Assess Read Quality with FastQC View Configure:

  • All you need to imput to assess quality is the name of your reads library

Results: This app will give you a full report of the quality of your reads. The report will have two pages, for the forward and reverse reads in the PairedEndLibrary. Here we'll be focusing on the Per Base Sequence Quality, but the rest of the report offers a bunch of useful information about our library.

  1. Per Base Sequence Quality: An overview of the quality scores for each position in the read. The red line is the median, blue the mean, the yellow bars the interquartile range and the black lines the 10% and 90% points.
  2. Per Sequence Quality Scores: This shows the distribution of quality scores across sequences. Ideally you will only see a single peak on the far right. If you see a second peak to the center or left, you may have a subset of low quality reads that need to be trimmed out.
  3. Per Base Sequence Content: This shows the proportion of each base present at each position in your reads. In a perfectly random run, you would expect to see parallel lines across this chart, but the relative amount of each base would depend on the genome(s) sequenced.
  4. Per Sequence GC Content: In a random library, you would expect this to fall across a normal distribution with a peak over the GC content of the chromosome. An unusually shaped distribution can indicate contamination or systemic error.
  5. Per Base N Content: Whenever a base is unable to be called, the sequencer labels it as N for any nucleotide. This chart shows the percent of bases at each postition that have been called as "N". It's not unusual to see some of these at the end of sequences when the quality drops, but a consistently high proportion or a peak in the start or middle of the sequence may indicate a problem in the analysis.
  6. Sequence Length Distribution: This shows the distribution of read sizes.
  7. Sequence Duplication Levels: This plot shows the relative number of sequences with each degree of duplication. It's based only on a subset of the data, but you should get a good idea of how many sequences have duplicates. If there are a number in the 2 or more, you may need to trim the library to remove them.
  8. Overrepresented Sequences: If any sequences represent more than 0.1% of the total sequences, they will be listed here. An abundance of any sequence may indicate contamination, a lack of diversity in the library or is biologically significant.
  9. Adapter Content: This graph shows where any adapters are found in your library.
  10. Kmer Content: If any Kmers are overrepresented in your library, they will be listed here.

For a more in depth look at understanding your FastQC report, check out the manual here: https://dnacore.missouri.edu/PDF/FastQC_Manual.pdf

Trim Reads with Trimmommatic View Configure

  • Read library or set: Select the library you want to trim.

  • Parameters: This section covers different trimming parameters specific to removing adapters, croping the sequences, and the quality thresholds required to trim a read. In this workflow, I'm going to leave them all as defaults, but you can learn more about them in the App Info page and in the KBase App catalog.

  • Output library name: Make sure to specify that this library has been trimmed so you can tell the two libraries apart later.

Once you've trimmed the read library, reassess the quality of the PairedEndLibrary using FastQC.

Results: A FastQC Report that details the quality of the reads you've submitted and possibly trimmed libraries.

A quality control application for high throughput sequence data.
This app completed without errors in 22m 53s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • Subsurface_gold_mine_reads_67335_2_1.fwd_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
  • Subsurface_gold_mine_reads_67335_2_1.rev_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
A quality control application for high throughput sequence data.
This app completed without errors in 31m 59s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • Hydraulic_fracture_well_fluid_raw_reads_67335_4_1.fwd_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
  • Hydraulic_fracture_well_fluid_raw_reads_67335_4_1.rev_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
A quality control application for high throughput sequence data.
This app completed without errors in 19m 56s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • Subsurface_gas_well_reads_67335_6_1.rev_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
  • Subsurface_gas_well_reads_67335_6_1.fwd_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
A quality control application for high throughput sequence data.
This app completed without errors in 33m 12s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • Rice_root_iron_plaque_reads_67335_8_1.fwd_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
  • Rice_root_iron_plaque_reads_67335_8_1.rev_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
A quality control application for high throughput sequence data.
This app completed without errors in 20m 11s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • Peat_soil_raw_reads_67335_20_1.fwd_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
  • Peat_soil_raw_reads_67335_20_1.rev_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report

Q1 Looking at the results of the FastQC app, which reads set(s) would you choose to trim and re-assess the quality of? Why did you pick that/those set(s)?

Trim paired- or single-end Illumina reads with Trimmomatic.
This app completed without errors in 2h 11m 54s.
Objects
Created Object Name Type Description
subsurface_gold_mine_reads_trimmed_paired PairedEndLibrary Trimmed Reads
subsurface_gold_mine_reads_trimmed_unpaired_fwd SingleEndLibrary Trimmed Unpaired Forward Reads
subsurface_gold_mine_reads_trimmed_unpaired_rev SingleEndLibrary Trimmed Unpaired Reverse Reads
A quality control application for high throughput sequence data.
This app completed without errors in 21m 48s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • subsurface_gold_mine_reads_trimmed_paired_67335_14_1.fwd_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
  • subsurface_gold_mine_reads_trimmed_paired_67335_14_1.rev_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
Trim paired- or single-end Illumina reads with Trimmomatic.
This app completed without errors in 38m 12s.
Objects
Created Object Name Type Description
Peat_soil_reads_trimmed_paired PairedEndLibrary Trimmed Reads
Peat_soil_reads_trimmed_unpaired_fwd SingleEndLibrary Trimmed Unpaired Forward Reads
Peat_soil_reads_trimmed_unpaired_rev SingleEndLibrary Trimmed Unpaired Reverse Reads
A quality control application for high throughput sequence data.
This app completed without errors in 20m 20s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • Peat_soil_reads_trimmed_paired_67335_33_1.fwd_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report
  • Peat_soil_reads_trimmed_paired_67335_33_1.rev_fastqc.zip - Zip file generated by fastqc that contains original images seen in the report

Step 3. Taxonomy Before Assembly

Before we assemble these libraries, it will be helpful to get an idea of what's present in our samples and at what abundance. KBase has two apps to do this, Kaiju and GOTTCHA2.

Apps:

  1. Classify Taxonomy of Metagenomic Reads with Kaiju
  2. Classify Taxonomy of Metagenomic Reads with GOTTCHA2

Timing: 20 mins-2 hours depending on queue and number of reads you're running

1. Classify Taxonomy of Metagenomic Reads with Kaiju

This app translates reads into proteins and uses those sequences to identify what's present or possibly present in the sample.

View Configure:

  • Read Library or Set: You can either run the app for each library or run it once with all read libraries.
  • Taxonomic Level: By default the app will show all levels from phylum to species.
  • Reference DB: Here you'll select the database to compare your reads against. Either RefSeq or BLAST are fine for our purposes and you don't need to include eukaryotes, since we're just looking for a bacteria.
  • Low Abundance Filter: This value filters out taxa that occur infrequently in your sample and helps simplify your results. After all, if something is present in low abundance, you're unlikely to be able to produce an assembled genome from a small amount of reads. The default is fine, but if you want to look at what is present in low abundance you can lower it.
  • Subsample Percent, Subsample Replicates, and Subsample Seed: To save time Kaiju only looks at a subsample of your reads. These settings allow you to adjust the percentage of your sample and replicates that Kaiju uses. In a truly random sample, there should be no variation between them, and this is generally true and can be seen below in the individual runs.
  • Filter Low Complexity: This should be set to filter so the algorithm only takes unique protein sequences into account.
  • Allow Imperfect Matches? Min Match Length, Allow Imperfect Matches, Greedy Max Mismatches, Greedy Min Bit Score, Greedy Max E-Value: These settings all relate to how much you want the algorithm to tolerate mismatches in protein sequences. How specific you want it to be is up to you, in this case we don't need it to be extremely specific since we only want a quick overview of what's present.
  • Sort Plots By: How you sort the plots is up to you, mine are sorted by total abundance, but you can also sort them alphabetically.

Results: Your results will be a series of tables showing the breakdown of your sample beginning with the phyla and ending at species. The tail includes everything that is present below the low abundance filter.

2. Classify Taxonomy of Metagenomic Reads with GOTTCHA2

Unlike Kaiju, this app shows relative abundance based on unique nucleotide sequences from RefSeq.

View Configure:

  • Read Library/Set: Add your reads library here.
  • Reference DB: You can either select the bacterial/viral/archaeal database or fungal database. In this case, we'll be using the first option.
  • Minimum Coverage: This is the minimum percentage of the unique genome signatures identified to be considered in the abundance calculation. Decreasing it will include lower abundance species, while increasing it will remove them.
  • Minimum Reads: The minimum number of reads to be included in the abundance calculation.
  • Minimum Length: The minimum length of reads for them to be included in the abundance calculation. If your library contains many short reads, you may want to reduce this value, but be aware that reducing it also increases the chance that you will get an incorrect result.
  • Maximum Zscore: This is based on the estimated Zscore. The default is fine for our Delftia search.

Results: There are three ways you can view the results from GOTTCHA2. The first is as a table showing the classification of your reads, some statistics regarding their abundance and their relative abundance. The second is as a phylogenetic tree showing the relationships of the different taxa identified in your sample. The third layout is as a Krona plot, an interactive plot that displays relative abundance and phylogenetic relationships. Clicking on a phylum will zoom in to show the classes within it. How far you can zoom down depends on the sample and the unique sequences in it.

These first two runs of Kaiju include the first 4 libraries run as one first with NCBI BLAST as the database, then with RefSeq as the database. The following 4 examples show the libraries individually with different settings to increase the proportion of the library represented.

Allows users to perform taxonomic classification of shotgun metagenomic read data with Kaiju.
This app completed without errors in 2h 56m 47s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • kaiju_classifications.zip
  • kaiju_summaries.zip
  • krona_data.zip
  • stacked_bar_abundance_plots_PNG+PDF.zip
Allows users to perform taxonomic classification of shotgun metagenomic read data with Kaiju.
This app completed without errors in 2h 46m 12s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • kaiju_classifications.zip
  • kaiju_summaries.zip
  • krona_data.zip
  • stacked_bar_abundance_plots_PNG+PDF.zip

These runs of Kaiju are broken down by sample and use the NCBI BLAST database. They are run on two subsamples.

Allows users to perform taxonomic classification of shotgun metagenomic read data with Kaiju.
This app completed without errors in 54m 12s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • kaiju_classifications.zip
  • kaiju_summaries.zip
  • krona_data.zip
  • stacked_bar_abundance_plots_PNG+PDF.zip

Q2 Do the two subsamples vary much? Is this expected or unexpected and why?

Allows users to perform taxonomic classification of shotgun metagenomic read data with Kaiju.
This app completed without errors in 57m 10s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • kaiju_classifications.zip
  • kaiju_summaries.zip
  • krona_data.zip
  • stacked_bar_abundance_plots_PNG+PDF.zip
Allows users to perform taxonomic classification of shotgun metagenomic read data with Kaiju.
This app completed without errors in 49m 30s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • kaiju_classifications.zip
  • kaiju_summaries.zip
  • krona_data.zip
  • stacked_bar_abundance_plots_PNG+PDF.zip
Allows users to perform taxonomic classification of shotgun metagenomic read data with Kaiju.
This app completed without errors in 54m 46s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • kaiju_classifications.zip
  • kaiju_summaries.zip
  • krona_data.zip
  • stacked_bar_abundance_plots_PNG+PDF.zip
Allows users to perform taxonomic classification of shotgun metagenomic read data with Kaiju.
This app completed without errors in 1h 21m 11s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • kaiju_classifications.zip
  • kaiju_summaries.zip
  • krona_data.zip
  • stacked_bar_abundance_plots_PNG+PDF.zip

This first run of GOTTCHA2 includes ALL read libraries as one sample. The 4 runs afterwards show the samples individually.

Uses GOTTCHA2 to provide taxonomic classifications of shotgun metagenomic reads data.
This app completed without errors in 2h 3m 1s.
Summary
GOTTCHA2 run finished on d0e40104-e7db-43b8-8463-8109aa0443da.inter.fastq.gz,617c1e51-fae4-4a3f-9081-45cd1b3f3189.inter.fastq.gz,3e51f290-7b31-4053-bf46-2f6ca081f0a3.inter.fastq.gz,561134de-8627-4cd0-ac5c-151284f468ec.inter.fastq.gz against RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • gottcha2.out.list
  • gottcha2.gottcha_species.sam
  • gottcha2.tsv
  • html_report
  • gottcha2.out.tab_tree
  • gottcha2.gottcha_species.log
  • gottcha2.summary.tsv
  • gottcha2.krona.html
  • gottcha2.full.tsv
  • gottcha2.lineage.tsv
Uses GOTTCHA2 to provide taxonomic classifications of shotgun metagenomic reads data.
This app completed without errors in 58m 45s.
Summary
GOTTCHA2 run finished on 617c1e51-fae4-4a3f-9081-45cd1b3f3189.inter.fastq.gz against RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • gottcha2.out.tab_tree
  • gottcha2.gottcha_species.sam
  • gottcha2.gottcha_species.log
  • gottcha2.full.tsv
  • gottcha2.tsv
  • gottcha2.summary.tsv
  • gottcha2.lineage.tsv
  • gottcha2.krona.html
  • html_report
  • gottcha2.out.list

Q3: Open the Krona plot from this sample. Viruses make up what percent of the sample?

Uses GOTTCHA2 to provide taxonomic classifications of shotgun metagenomic reads data.
This app completed without errors in 41m 1s.
Summary
GOTTCHA2 run finished on 3e51f290-7b31-4053-bf46-2f6ca081f0a3.inter.fastq.gz against RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • gottcha2.lineage.tsv
  • html_report
  • gottcha2.out.tab_tree
  • gottcha2.full.tsv
  • gottcha2.gottcha_species.log
  • gottcha2.out.list
  • gottcha2.summary.tsv
  • gottcha2.krona.html
  • gottcha2.tsv
  • gottcha2.gottcha_species.sam
Uses GOTTCHA2 to provide taxonomic classifications of shotgun metagenomic reads data.
This app completed without errors in 26m 35s.
Summary
GOTTCHA2 run finished on d0e40104-e7db-43b8-8463-8109aa0443da.inter.fastq.gz against RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • gottcha2.tsv
  • gottcha2.gottcha_species.log
  • gottcha2.gottcha_species.sam
  • gottcha2.full.tsv
  • gottcha2.lineage.tsv
  • gottcha2.summary.tsv
  • gottcha2.out.list
  • gottcha2.out.tab_tree
  • gottcha2.krona.html
  • html_report

Q4: What is the relative abundance of Betaproteobacteria in this sample?

Uses GOTTCHA2 to provide taxonomic classifications of shotgun metagenomic reads data.
This app completed without errors in 41m 20s.
Summary
GOTTCHA2 run finished on 561134de-8627-4cd0-ac5c-151284f468ec.inter.fastq.gz against RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • gottcha2.gottcha_species.sam
  • gottcha2.gottcha_species.log
  • gottcha2.out.list
  • html_report
  • gottcha2.full.tsv
  • gottcha2.summary.tsv
  • gottcha2.tsv
  • gottcha2.out.tab_tree
  • gottcha2.lineage.tsv
  • gottcha2.krona.html
Uses GOTTCHA2 to provide taxonomic classifications of shotgun metagenomic reads data.
This app completed without errors in 21m 28s.
Summary
GOTTCHA2 run finished on f00edf09-9f80-4430-9bb1-917ac99aa5ca.inter.fastq.gz against RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • gottcha2.gottcha_species.sam
  • html_report
  • gottcha2.tsv
  • gottcha2.out.tab_tree
  • gottcha2.out.list
  • gottcha2.summary.tsv
  • gottcha2.krona.html
  • gottcha2.full.tsv
  • gottcha2.gottcha_species.log
  • gottcha2.lineage.tsv

Q5: Which sample(s) look the most promising based on the taxonomy results from GOTTCHA2?

Step 4. Assembly of Metagenomic Data

Alright, you've made it this far. This step takes the longest, so you may want to set it up to run overnight or over the weekend. This step will take our read libraries and line them up into longer sequences called contigs. Later we'll sort these contigs based on what genomes they came from. We'll be using three different apps to generate 4 different sets of contigs.

Apps:

  1. Assemble Reads with MetaSPAdes
  2. Assemble Reads with MEGAHIT (we'll run this app twice)
  3. Assemble Reads with IDBA-UD

Timing: hours to days (One of my assemblies below ran for almsot 4 days.)

1. Assemble Reads with MetaSPAdes

View Configure:

  • Read Library: Add your reads here. If you trimmed your library, be sure to use the trimmed version. MetaSPAdes will ONLY accept a PairedEnd Library.
  • Minimum Contig Length: The smallest contig the assembler can produce. Larger contigs have a higher chance of containing a whole gene or unique sequence, but there will be fewer of them. Smaller contigs have a lower chance of containing a whole gene and can wind up contaminating your genomes if they are binned incorrectly. However, there will be a lot more of them. The default here is 2,000 and that's what I'm sticking with below for all my assemblies.
  • K-mer Sizes: K-mers are used in the assembly process to look at the contigs to look for overlapping regions. If you want to learn more about this process, check out this video.
  • Assembly Only(no error correction): Don't select this, for our search we want to correct for any errors that may occur.

2. Assemble Reads with MEGAHIT (run 2x)

View Configure:

  • Read Library: Add your reads here. If you have trimmed the library, use the trimmed version. MEGAHIT will only accept a PairedEnd Library.
  • Parameter Preset: You'll run this app twice, once set at "Meta-sensitive" and again set at "Meta-large."
  • Minimum Contig Length: The default here is fine, 2,000 bp is a good size for your smallest contig, at that size hopefully they'll all contain some identifying feature(s).
  • K-mers: I left these at their defaults. If you want to learn more about K-mers and their role in assembly, check out the video above.
  • Output Assembly Name: Be sure to specify in the name which parameter preset you used to create the assembly.

3. Assemble Reads with IDBA-UD

View Configure:

  • Reads Library: Add in your reads here. If they have been trimmed, you should add in the trimmed, paired library. IDBA-UD can accept a SingleEnd Library, but you should use the PairedEnd library.
  • Minimum Contig Length: This sets the size for the smallest your contigs can be. The default is 2,000.

Results: Regardless of the assembly you use, you'll get the same result; a QUAST report. This report details the major features of your assembly and important statistics, such as the length of the longest read, the number of reads longer than 1,000,000 bases, the N50 and L50. In the next step we'll compare these statistics and receive a visual representation of the key parts of this report. For now, take note of which assembly contains the most base pairs.

Sample 1. "Subsurface Gold Mine Reads"

Assemble paired-end reads from single-cell or metagenomic sequencing technologies using the IDBA-UD assembler.
This app completed without errors in 5h 0m 24s.
Objects
Created Object Name Type Description
Subsurface_gold_mine_IDBA.contigs Assembly Assembled contigs
Summary
Assembly saved to: rkrebs:narrative_1594573720077/Subsurface_gold_mine_IDBA.contigs Assembled into 12820 contigs. Avg Length: 8513.19953198 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 12576 -- 2099.0 to 51218.6 bp 188 -- 51218.6 to 100338.2 bp 40 -- 100338.2 to 149457.8 bp 9 -- 149457.8 to 198577.4 bp 2 -- 198577.4 to 247697.0 bp 1 -- 247697.0 to 296816.6 bp 1 -- 296816.6 to 345936.2 bp 2 -- 345936.2 to 395055.8 bp 0 -- 395055.8 to 444175.4 bp 1 -- 444175.4 to 493295.0 bp
Links
Assemble metagenomic reads using the MEGAHIT assembler.
This app completed without errors in 1h 23m 3s.
Objects
Created Object Name Type Description
subsurface_gold_mine_meta_sensitive_MEGAHIT.assembly Assembly Assembled contigs
Summary
ContigSet saved to: rkrebs:narrative_1594573720077/subsurface_gold_mine_meta_sensitive_MEGAHIT.assembly Assembled into 17096 contigs. Avg Length: 8411.282229761347 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 16943 -- 2000.0 to 87200.5 bp 122 -- 87200.5 to 172401.0 bp 18 -- 172401.0 to 257601.5 bp 8 -- 257601.5 to 342802.0 bp 2 -- 342802.0 to 428002.5 bp 0 -- 428002.5 to 513203.0 bp 2 -- 513203.0 to 598403.5 bp 0 -- 598403.5 to 683604.0 bp 0 -- 683604.0 to 768804.5 bp 1 -- 768804.5 to 854005.0 bp
Links
Assemble metagenomic reads using the MEGAHIT assembler.
This app completed without errors in 1h 28m 6s.
Objects
Created Object Name Type Description
subsurface_gold_mine_meta_large_MEGAHIT.assembly Assembly Assembled contigs
Summary
ContigSet saved to: rkrebs:narrative_1594573720077/subsurface_gold_mine_meta_large_MEGAHIT.assembly Assembled into 16132 contigs. Avg Length: 8564.84496652616 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 15980 -- 2000.0 to 87200.1 bp 120 -- 87200.1 to 172400.2 bp 20 -- 172400.2 to 257600.30000000002 bp 6 -- 257600.30000000002 to 342800.4 bp 3 -- 342800.4 to 428000.5 bp 0 -- 428000.5 to 513200.60000000003 bp 0 -- 513200.60000000003 to 598400.7000000001 bp 1 -- 598400.7000000001 to 683600.8 bp 1 -- 683600.8 to 768800.9 bp 1 -- 768800.9 to 854001.0 bp
Links
Assemble metagenomic reads using the SPAdes assembler.
This app completed without errors in 8h 17m 9s.
Objects
Created Object Name Type Description
subsurface_gold_mine_SPAdes.contigs Assembly Assembled contigs
Summary
Assembly saved to: rkrebs:narrative_1594573720077/subsurface_gold_mine_SPAdes.contigs Assembled into 19935 contigs. Avg Length: 7915.2622021570105 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 19788 -- 2000.0 to 91549.5 bp 112 -- 91549.5 to 181099.0 bp 24 -- 181099.0 to 270648.5 bp 4 -- 270648.5 to 360198.0 bp 3 -- 360198.0 to 449747.5 bp 1 -- 449747.5 to 539297.0 bp 0 -- 539297.0 to 628846.5 bp 0 -- 628846.5 to 718396.0 bp 2 -- 718396.0 to 807945.5 bp 1 -- 807945.5 to 897495.0 bp
Links

Sample 2. "Hydraulic fracture well fluid"

Assemble paired-end reads from single-cell or metagenomic sequencing technologies using the IDBA-UD assembler.
This app completed without errors in 2h 51m 54s.
Objects
Created Object Name Type Description
hydraulic_fracture_well_IDBA.contigs Assembly Assembled contigs
Summary
Assembly saved to: rkrebs:narrative_1594573720077/hydraulic_fracture_well_IDBA.contigs Assembled into 2503 contigs. Avg Length: 9722.07391131 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 2403 -- 2005.0 to 41371.9 bp 76 -- 41371.9 to 80738.8 bp 11 -- 80738.8 to 120105.7 bp 5 -- 120105.7 to 159472.6 bp 5 -- 159472.6 to 198839.5 bp 2 -- 198839.5 to 238206.4 bp 0 -- 238206.4 to 277573.3 bp 0 -- 277573.3 to 316940.2 bp 0 -- 316940.2 to 356307.1 bp 1 -- 356307.1 to 395674.0 bp
Links
Assemble metagenomic reads using the MEGAHIT assembler.
This app completed without errors in 41m 51s.
Objects
Created Object Name Type Description
meta_sensitive_hydraulic_well_2_MEGAHIT.assembly Assembly Assembled contigs
Summary
ContigSet saved to: rkrebs:narrative_1594573720077/meta_sensitive_hydraulic_well_2_MEGAHIT.assembly Assembled into 3001 contigs. Avg Length: 9164.475508163945 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 2848 -- 2000.0 to 33992.3 bp 105 -- 33992.3 to 65984.6 bp 25 -- 65984.6 to 97976.9 bp 9 -- 97976.9 to 129969.2 bp 4 -- 129969.2 to 161961.5 bp 3 -- 161961.5 to 193953.8 bp 3 -- 193953.8 to 225946.1 bp 2 -- 225946.1 to 257938.4 bp 1 -- 257938.4 to 289930.7 bp 1 -- 289930.7 to 321923.0 bp
Links
Assemble metagenomic reads using the MEGAHIT assembler.
This app completed without errors in 48m 1s.
Objects
Created Object Name Type Description
meta_large_hydraulic_fracture_well_2_MEGAHIT.assembly Assembly Assembled contigs
Summary
ContigSet saved to: rkrebs:narrative_1594573720077/meta_large_hydraulic_fracture_well_2_MEGAHIT.assembly Assembled into 3058 contigs. Avg Length: 9151.73577501635 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 2926 -- 2000.0 to 37619.6 bp 88 -- 37619.6 to 73239.2 bp 26 -- 73239.2 to 108858.79999999999 bp 8 -- 108858.79999999999 to 144478.4 bp 3 -- 144478.4 to 180098.0 bp 4 -- 180098.0 to 215717.59999999998 bp 1 -- 215717.59999999998 to 251337.19999999998 bp 1 -- 251337.19999999998 to 286956.8 bp 0 -- 286956.8 to 322576.39999999997 bp 1 -- 322576.39999999997 to 358196.0 bp
Links
Assemble metagenomic reads using the SPAdes assembler.
This app completed without errors in 9h 26m 44s.
Objects
Created Object Name Type Description
hydraulic_fracture_well_SPAdes.contigs Assembly Assembled contigs
Summary
Assembly saved to: rkrebs:narrative_1594573720077/hydraulic_fracture_well_SPAdes.contigs Assembled into 2743 contigs. Avg Length: 9518.420342690484 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 2599 -- 2000.0 to 34041.2 bp 97 -- 34041.2 to 66082.4 bp 21 -- 66082.4 to 98123.6 bp 8 -- 98123.6 to 130164.8 bp 7 -- 130164.8 to 162206.0 bp 3 -- 162206.0 to 194247.2 bp 1 -- 194247.2 to 226288.4 bp 3 -- 226288.4 to 258329.6 bp 2 -- 258329.6 to 290370.8 bp 2 -- 290370.8 to 322412.0 bp
Links

Sample 3. "Subsurface gas well"

Assemble paired-end reads from single-cell or metagenomic sequencing technologies using the IDBA-UD assembler.
This app completed without errors in 8h 43m 33s.
Objects
Created Object Name Type Description
subsurface_gas_well_IDBA.contigs Assembly Assembled contigs
Summary
Assembly saved to: rkrebs:narrative_1594573720077/subsurface_gas_well_IDBA.contigs Assembled into 16294 contigs. Avg Length: 7065.12796121 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 16097 -- 2000.0 to 49012.4 bp 140 -- 49012.4 to 96024.8 bp 28 -- 96024.8 to 143037.2 bp 12 -- 143037.2 to 190049.6 bp 10 -- 190049.6 to 237062.0 bp 4 -- 237062.0 to 284074.4 bp 1 -- 284074.4 to 331086.8 bp 0 -- 331086.8 to 378099.2 bp 1 -- 378099.2 to 425111.6 bp 1 -- 425111.6 to 472124.0 bp
Links
Assemble metagenomic reads using the MEGAHIT assembler.
This app completed without errors in 2h 21m 34s.
Objects
Created Object Name Type Description
subsurface_gas_well_meta_sensitive_MEGAHIT.assembly Assembly Assembled contigs
Summary
ContigSet saved to: rkrebs:narrative_1594573720077/subsurface_gas_well_meta_sensitive_MEGAHIT.assembly Assembled into 23848 contigs. Avg Length: 6827.34728279101 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 23730 -- 2000.0 to 82893.0 bp 98 -- 82893.0 to 163786.0 bp 13 -- 163786.0 to 244679.0 bp 5 -- 244679.0 to 325572.0 bp 0 -- 325572.0 to 406465.0 bp 0 -- 406465.0 to 487358.0 bp 0 -- 487358.0 to 568251.0 bp 0 -- 568251.0 to 649144.0 bp 1 -- 649144.0 to 730037.0 bp 1 -- 730037.0 to 810930.0 bp
Links
Assemble metagenomic reads using the MEGAHIT assembler.
This app completed without errors in 4h 11m 40s.
Objects
Created Object Name Type Description
subsurface_gas_well_reads_meta_large_MEGAHIT.assembly Assembly Assembled contigs
Summary
ContigSet saved to: rkrebs:narrative_1594573720077/subsurface_gas_well_reads_meta_large_MEGAHIT.assembly Assembled into 24307 contigs. Avg Length: 6503.436376352491 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 24196 -- 2000.0 to 82892.8 bp 90 -- 82892.8 to 163785.6 bp 14 -- 163785.6 to 244678.40000000002 bp 5 -- 244678.40000000002 to 325571.2 bp 0 -- 325571.2 to 406464.0 bp 0 -- 406464.0 to 487356.80000000005 bp 0 -- 487356.80000000005 to 568249.6 bp 0 -- 568249.6 to 649142.4 bp 1 -- 649142.4 to 730035.2000000001 bp 1 -- 730035.2000000001 to 810928.0 bp
Links
Assemble metagenomic reads using the SPAdes assembler.
This app completed without errors in 11h 11m 15s.
Objects
Created Object Name Type Description
subsurface_gas_well_SPAdes.contigs Assembly Assembled contigs
Summary
Assembly saved to: rkrebs:narrative_1594573720077/subsurface_gas_well_SPAdes.contigs Assembled into 19896 contigs. Avg Length: 7077.755629272216 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 19860 -- 2000.0 to 155708.3 bp 30 -- 155708.3 to 309416.6 bp 3 -- 309416.6 to 463124.89999999997 bp 2 -- 463124.89999999997 to 616833.2 bp 0 -- 616833.2 to 770541.5 bp 0 -- 770541.5 to 924249.7999999999 bp 0 -- 924249.7999999999 to 1077958.0999999999 bp 0 -- 1077958.0999999999 to 1231666.4 bp 0 -- 1231666.4 to 1385374.7 bp 1 -- 1385374.7 to 1539083.0 bp
Links

Q6 How long did it take (including the queue time) to assemble this reads dataset with metaSPAdes?

Sample 4. "Rice Root Iron Plaque Reads"

Assemble paired-end reads from single-cell or metagenomic sequencing technologies using the IDBA-UD assembler.
This app completed without errors in 1d 9h 10m 17s.
Objects
Created Object Name Type Description
rice_root_iron_plaque_IDBA.contigs Assembly Assembled contigs
Summary
Assembly saved to: rkrebs:narrative_1594573720077/rice_root_iron_plaque_IDBA.contigs Assembled into 13485 contigs. Avg Length: 7405.00786059 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 13270 -- 2000.0 to 55909.2 bp 125 -- 55909.2 to 109818.4 bp 42 -- 109818.4 to 163727.6 bp 20 -- 163727.6 to 217636.8 bp 12 -- 217636.8 to 271546.0 bp 6 -- 271546.0 to 325455.2 bp 1 -- 325455.2 to 379364.4 bp 3 -- 379364.4 to 433273.6 bp 4 -- 433273.6 to 487182.8 bp 2 -- 487182.8 to 541092.0 bp
Links
Assemble metagenomic reads using the MEGAHIT assembler.
This app completed without errors in 5h 24m 0s.
Objects
Created Object Name Type Description
rice_root_iron_plaque_meta_sensitive_MEGAHIT.assembly Assembly Assembled contigs
Summary
ContigSet saved to: rkrebs:narrative_1594573720077/rice_root_iron_plaque_meta_sensitive_MEGAHIT.assembly Assembled into 20599 contigs. Avg Length: 7126.803194329822 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 20451 -- 2000.0 to 84726.5 bp 81 -- 84726.5 to 167453.0 bp 30 -- 167453.0 to 250179.5 bp 13 -- 250179.5 to 332906.0 bp 12 -- 332906.0 to 415632.5 bp 5 -- 415632.5 to 498359.0 bp 2 -- 498359.0 to 581085.5 bp 2 -- 581085.5 to 663812.0 bp 2 -- 663812.0 to 746538.5 bp 1 -- 746538.5 to 829265.0 bp
Links
Assemble metagenomic reads using the MEGAHIT assembler.
This app completed without errors in 4h 13m 16s.
Objects
Created Object Name Type Description
rice_root_iron_plaque_meta_large_MEGAHIT.assembly Assembly Assembled contigs
Summary
ContigSet saved to: rkrebs:narrative_1594573720077/rice_root_iron_plaque_meta_large_MEGAHIT.assembly Assembled into 20919 contigs. Avg Length: 6835.660834647928 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 20783 -- 2000.0 to 85355.1 bp 70 -- 85355.1 to 168710.2 bp 30 -- 168710.2 to 252065.30000000002 bp 10 -- 252065.30000000002 to 335420.4 bp 9 -- 335420.4 to 418775.5 bp 8 -- 418775.5 to 502130.60000000003 bp 3 -- 502130.60000000003 to 585485.7000000001 bp 3 -- 585485.7000000001 to 668840.8 bp 2 -- 668840.8 to 752195.9 bp 1 -- 752195.9 to 835551.0 bp
Links
Assemble metagenomic reads using the SPAdes assembler.
This app produced errors.
No output found.

Sample 5. "Peat Soil"

Assemble paired-end reads from single-cell or metagenomic sequencing technologies using the IDBA-UD assembler.
This app completed without errors in 11h 59m 1s.
Objects
Created Object Name Type Description
Peat_soil_IDBA.contigs Assembly Assembled contigs
Summary
Assembly saved to: rkrebs:narrative_1594573720077/Peat_soil_IDBA.contigs Assembled into 6303 contigs. Avg Length: 5156.55005553 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 5846 -- 2003.0 to 11337.0 bp 317 -- 11337.0 to 20671.0 bp 86 -- 20671.0 to 30005.0 bp 41 -- 30005.0 to 39339.0 bp 4 -- 39339.0 to 48673.0 bp 7 -- 48673.0 to 58007.0 bp 1 -- 58007.0 to 67341.0 bp 0 -- 67341.0 to 76675.0 bp 0 -- 76675.0 to 86009.0 bp 1 -- 86009.0 to 95343.0 bp
Links
Assemble metagenomic reads using the MEGAHIT assembler.
This app completed without errors in 10h 2m 21s.
Objects
Created Object Name Type Description
peat_soil_meta_sensitive_MEGAHIT.assembly Assembly Assembled contigs
Summary
ContigSet saved to: rkrebs:narrative_1594573720077/peat_soil_meta_sensitive_MEGAHIT.assembly Assembled into 11580 contigs. Avg Length: 4281.373143350605 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 11221 -- 2000.0 to 13641.1 bp 258 -- 13641.1 to 25282.2 bp 58 -- 25282.2 to 36923.3 bp 25 -- 36923.3 to 48564.4 bp 7 -- 48564.4 to 60205.5 bp 3 -- 60205.5 to 71846.6 bp 3 -- 71846.6 to 83487.7 bp 3 -- 83487.7 to 95128.8 bp 1 -- 95128.8 to 106769.90000000001 bp 1 -- 106769.90000000001 to 118411.0 bp
Links
Assemble metagenomic reads using the MEGAHIT assembler.
This app completed without errors in 10h 0m 11s.
Objects
Created Object Name Type Description
peat_soil_meta_large_MEGAHIT.assembly Assembly Assembled contigs
Summary
ContigSet saved to: rkrebs:narrative_1594573720077/peat_soil_meta_large_MEGAHIT.assembly Assembled into 10906 contigs. Avg Length: 4173.506968641115 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 10357 -- 2000.0 to 10402.1 bp 394 -- 10402.1 to 18804.2 bp 87 -- 18804.2 to 27206.300000000003 bp 39 -- 27206.300000000003 to 35608.4 bp 12 -- 35608.4 to 44010.5 bp 7 -- 44010.5 to 52412.600000000006 bp 2 -- 52412.600000000006 to 60814.700000000004 bp 3 -- 60814.700000000004 to 69216.8 bp 4 -- 69216.8 to 77618.90000000001 bp 1 -- 77618.90000000001 to 86021.0 bp
Links
Assemble metagenomic reads using the SPAdes assembler.
This app completed without errors in 3d 2h 53m 21s.
Objects
Created Object Name Type Description
peat_soil_SPAdes.contigs Assembly Assembled contigs
Summary
Assembly saved to: rkrebs:narrative_1594573720077/peat_soil_SPAdes.contigs Assembled into 8127 contigs. Avg Length: 3919.6416881998275 bp. Contig Length Distribution (# of contigs -- min to max basepairs): 8007 -- 2000.0 to 16499.3 bp 96 -- 16499.3 to 30998.6 bp 14 -- 30998.6 to 45497.899999999994 bp 5 -- 45497.899999999994 to 59997.2 bp 2 -- 59997.2 to 74496.5 bp 1 -- 74496.5 to 88995.79999999999 bp 0 -- 88995.79999999999 to 103495.09999999999 bp 1 -- 103495.09999999999 to 117994.4 bp 0 -- 117994.4 to 132493.7 bp 1 -- 132493.7 to 146993.0 bp
Links

Step 5. Compare assemblies

Success! You've created 4 different assemblies from your metagenomic data. Now, you need to pick the best to sort out all those contigs into genomes. That step, sorting the contigs into different genome bins is called binning. Each bin will represent a single genome, but more on that later.

App: Compare Assembled Contig Distributions

Timing: 5-10 mins.

View Configure:

All you need to add here are your different assemblies from above. For example, in the first run I'll compare the Subsurface_gold_mine_meta_sensitive_MEGAHIT.assembly, subsurface_gold_mine_meta_large_MEGAHIT.assembly, subsurface_gold_mine_IDBA.contigs, and subsurface_gold_mine_SPAdes to determine which is the best to use moving forward.

The Results: A report showing the different statistics from each assembly. Some key features of this report include:

  1. N50: the contig length such that using longer or equal length contigs produces half the bases of assembly.
  2. L50: the minimum number of contigs that produce half of the assembly.
  3. The length of the longest contig.
  4. Histograms showing the distribution of contig lengths.

A good assembly will have:

  • A longer largest contig.
  • More contigs over 100,00 bp long
  • More contigs over 1,000,000 bp long.
  • A high N50.
  • A low L50.

A bad assembly will have:

  • A short longest contig.
  • Few contigs over 100,000 bp long.
  • No contigs over 1,000,00 bp long.
  • A low N50.
  • A high L50
View distributions of contig characteristics for different assemblies.
This app completed without errors in 5m 14s.
Summary
ASSEMBLY STATS for subsurface_gold_mine_meta_sensitive_MEGAHIT.assembly Len longest contig: 854005 bp N50 (L50): 18619 (1502) N75 (L75): 5455 (5245) N90 (L90): 2803 (10950) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 120 Num contigs >= 10000 bp: 2873 Num contigs >= 1000 bp: 17096 Num contigs >= 500 bp: 17096 Num contigs >= 1 bp: 17096 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 20040138 bp Len contigs >= 10000 bp: 90528043 bp Len contigs >= 1000 bp: 143799281 bp Len contigs >= 500 bp: 143799281 bp Len contigs >= 1 bp: 143799281 bp ASSEMBLY STATS for subsurface_gold_mine_SPAdes.contigs Len longest contig: 897495 bp N50 (L50): 16418 (1778) N75 (L75): 4894 (6469) N90 (L90): 2722 (13128) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 126 Num contigs >= 10000 bp: 3063 Num contigs >= 1000 bp: 19935 Num contigs >= 500 bp: 19935 Num contigs >= 1 bp: 19935 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 22244252 bp Len contigs >= 10000 bp: 95194833 bp Len contigs >= 1000 bp: 157790752 bp Len contigs >= 500 bp: 157790752 bp Len contigs >= 1 bp: 157790752 bp ASSEMBLY STATS for Subsurface_gold_mine_IDBA.contigs Len longest contig: 493295 bp N50 (L50): 14540 (1548) N75 (L75): 5750 (4634) N90 (L90): 3143 (8547) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 56 Num contigs >= 10000 bp: 2450 Num contigs >= 1000 bp: 12820 Num contigs >= 500 bp: 12820 Num contigs >= 1 bp: 12820 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 8538521 bp Len contigs >= 10000 bp: 65468457 bp Len contigs >= 1000 bp: 109139218 bp Len contigs >= 500 bp: 109139218 bp Len contigs >= 1 bp: 109139218 bp ASSEMBLY STATS for subsurface_gold_mine_meta_large_MEGAHIT.assembly Len longest contig: 854001 bp N50 (L50): 19131 (1402) N75 (L75): 5629 (4898) N90 (L90): 2841 (10259) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 116 Num contigs >= 10000 bp: 2779 Num contigs >= 1000 bp: 16132 Num contigs >= 500 bp: 16132 Num contigs >= 1 bp: 16132 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 19988337 bp Len contigs >= 10000 bp: 87896317 bp Len contigs >= 1000 bp: 138168079 bp Len contigs >= 500 bp: 138168079 bp Len contigs >= 1 bp: 138168079 bp
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • key_plot.png
  • key_plot.pdf
  • cumulative_len_plot.png
  • cumulative_len_plot.pdf
  • sorted_contig_lengths.png
  • sorted_contig_lengths.pdf
  • histogram_figures.zip

Q7a Which assembly would you use for binning contigs from this sample? Why did you pick that assembly?

Q7b Which is the worst choice for binning contigs? Why is it the worst?

View distributions of contig characteristics for different assemblies.
This app completed without errors in 4m 0s.
Summary
ASSEMBLY STATS for meta_large_hydraulic_fracture_well_2_MEGAHIT.assembly Len longest contig: 358196 bp N50 (L50): 20548 (272) N75 (L75): 6236 (901) N90 (L90): 3022 (1901) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 23 Num contigs >= 10000 bp: 589 Num contigs >= 1000 bp: 3058 Num contigs >= 500 bp: 3058 Num contigs >= 1 bp: 3058 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 3683273 bp Len contigs >= 10000 bp: 18547024 bp Len contigs >= 1000 bp: 27986008 bp Len contigs >= 500 bp: 27986008 bp Len contigs >= 1 bp: 27986008 bp ASSEMBLY STATS for meta_sensitive_hydraulic_well_2_MEGAHIT.assembly Len longest contig: 321923 bp N50 (L50): 20609 (274) N75 (L75): 6285 (884) N90 (L90): 3030 (1865) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 22 Num contigs >= 10000 bp: 586 Num contigs >= 1000 bp: 3001 Num contigs >= 500 bp: 3001 Num contigs >= 1 bp: 3001 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 3744231 bp Len contigs >= 10000 bp: 18256390 bp Len contigs >= 1000 bp: 27502591 bp Len contigs >= 500 bp: 27502591 bp Len contigs >= 1 bp: 27502591 bp ASSEMBLY STATS for hydraulic_fracture_well_SPAdes.contigs Len longest contig: 322412 bp N50 (L50): 21964 (223) N75 (L75): 6611 (776) N90 (L90): 3075 (1673) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 24 Num contigs >= 10000 bp: 553 Num contigs >= 1000 bp: 2743 Num contigs >= 500 bp: 2743 Num contigs >= 1 bp: 2743 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 4394425 bp Len contigs >= 10000 bp: 17799567 bp Len contigs >= 1000 bp: 26109027 bp Len contigs >= 500 bp: 26109027 bp Len contigs >= 1 bp: 26109027 bp ASSEMBLY STATS for hydraulic_fracture_well_IDBA.contigs Len longest contig: 395674 bp N50 (L50): 19568 (253) N75 (L75): 6682 (801) N90 (L90): 3402 (1576) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 16 Num contigs >= 10000 bp: 549 Num contigs >= 1000 bp: 2503 Num contigs >= 500 bp: 2503 Num contigs >= 1 bp: 2503 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 2697737 bp Len contigs >= 10000 bp: 16193696 bp Len contigs >= 1000 bp: 24334351 bp Len contigs >= 500 bp: 24334351 bp Len contigs >= 1 bp: 24334351 bp
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • key_plot.png
  • key_plot.pdf
  • cumulative_len_plot.png
  • cumulative_len_plot.pdf
  • sorted_contig_lengths.png
  • sorted_contig_lengths.pdf
  • histogram_figures.zip

Q8 Which assembly should I use to bin the contigs from the hydraulic fracture well fluid sample? Why?

View distributions of contig characteristics for different assemblies.
This app completed without errors in 5m 20s.
Summary
ASSEMBLY STATS for subsurface_gas_well_SPAdes.contigs Len longest contig: 1539083 bp N50 (L50): 11636 (2142) N75 (L75): 4310 (7435) N90 (L90): 2664 (13775) Num contigs >= 1000000 bp: 1 Num contigs >= 100000 bp: 96 Num contigs >= 10000 bp: 2576 Num contigs >= 1000 bp: 19896 Num contigs >= 500 bp: 19896 Num contigs >= 1 bp: 19896 Len contigs >= 1000000 bp: 1539083 bp Len contigs >= 100000 bp: 17226601 bp Len contigs >= 10000 bp: 75078548 bp Len contigs >= 1000 bp: 140819026 bp Len contigs >= 500 bp: 140819026 bp Len contigs >= 1 bp: 140819026 bp ASSEMBLY STATS for subsurface_gas_well_reads_meta_large_MEGAHIT.assembly Len longest contig: 810928 bp N50 (L50): 9490 (3125) N75 (L75): 3975 (9860) N90 (L90): 2597 (17353) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 82 Num contigs >= 10000 bp: 2894 Num contigs >= 1000 bp: 24307 Num contigs >= 500 bp: 24307 Num contigs >= 1 bp: 24307 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 13187275 bp Len contigs >= 10000 bp: 76789557 bp Len contigs >= 1000 bp: 158079028 bp Len contigs >= 500 bp: 158079028 bp Len contigs >= 1 bp: 158079028 bp ASSEMBLY STATS for rice_root_iron_plaque_meta_sensitive_MEGAHIT.assembly Len longest contig: 829265 bp N50 (L50): 11169 (2036) N75 (L75): 4302 (7686) N90 (L90): 2698 (14240) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 119 Num contigs >= 10000 bp: 2350 Num contigs >= 1000 bp: 20599 Num contigs >= 500 bp: 20599 Num contigs >= 1 bp: 20599 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 28087430 bp Len contigs >= 10000 bp: 76722780 bp Len contigs >= 1000 bp: 146805019 bp Len contigs >= 500 bp: 146805019 bp Len contigs >= 1 bp: 146805019 bp ASSEMBLY STATS for subsurface_gas_well_IDBA.contigs Len longest contig: 472124 bp N50 (L50): 9971 (2321) N75 (L75): 4561 (6715) N90 (L90): 2849 (11545) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 53 Num contigs >= 10000 bp: 2305 Num contigs >= 1000 bp: 16294 Num contigs >= 500 bp: 16294 Num contigs >= 1 bp: 16294 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 9139990 bp Len contigs >= 10000 bp: 57400959 bp Len contigs >= 1000 bp: 115119195 bp Len contigs >= 500 bp: 115119195 bp Len contigs >= 1 bp: 115119195 bp
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • key_plot.png
  • key_plot.pdf
  • cumulative_len_plot.png
  • cumulative_len_plot.pdf
  • sorted_contig_lengths.png
  • sorted_contig_lengths.pdf
  • histogram_figures.zip

Q9a Which assembly is the best for binning the contigs from the subsurface gas well sample? Is there more than one option? If so, which one would you use and why?

Q9b Which assembly can you rule out?

View distributions of contig characteristics for different assemblies.
This app completed without errors in 4m 32s.
Summary
ASSEMBLY STATS for rice_root_iron_plaque_IDBA.contigs Len longest contig: 541092 bp N50 (L50): 11386 (1236) N75 (L75): 4438 (5014) N90 (L90): 2792 (9325) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 105 Num contigs >= 10000 bp: 1487 Num contigs >= 1000 bp: 13485 Num contigs >= 500 bp: 13485 Num contigs >= 1 bp: 13485 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 19861856 bp Len contigs >= 10000 bp: 52604936 bp Len contigs >= 1000 bp: 99856531 bp Len contigs >= 500 bp: 99856531 bp Len contigs >= 1 bp: 99856531 bp ASSEMBLY STATS for rice_root_iron_plaque_SPAdes.contigs Len longest contig: 1363136 bp N50 (L50): 13619 (1307) N75 (L75): 4471 (5908) N90 (L90): 2734 (11623) Num contigs >= 1000000 bp: 2 Num contigs >= 100000 bp: 134 Num contigs >= 10000 bp: 1966 Num contigs >= 1000 bp: 17265 Num contigs >= 500 bp: 17265 Num contigs >= 1 bp: 17265 Len contigs >= 1000000 bp: 2605423 bp Len contigs >= 100000 bp: 32453930 bp Len contigs >= 10000 bp: 73147667 bp Len contigs >= 1000 bp: 131025942 bp Len contigs >= 500 bp: 131025942 bp Len contigs >= 1 bp: 131025942 bp ASSEMBLY STATS for rice_root_iron_plaque_meta_sensitive_MEGAHIT.assembly Len longest contig: 829265 bp N50 (L50): 11169 (2036) N75 (L75): 4302 (7686) N90 (L90): 2698 (14240) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 119 Num contigs >= 10000 bp: 2350 Num contigs >= 1000 bp: 20599 Num contigs >= 500 bp: 20599 Num contigs >= 1 bp: 20599 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 28087430 bp Len contigs >= 10000 bp: 76722780 bp Len contigs >= 1000 bp: 146805019 bp Len contigs >= 500 bp: 146805019 bp Len contigs >= 1 bp: 146805019 bp ASSEMBLY STATS for rice_root_iron_plaque_meta_large_MEGAHIT.assembly Len longest contig: 835551 bp N50 (L50): 10022 (2176) N75 (L75): 4104 (8070) N90 (L90): 2636 (14664) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 108 Num contigs >= 10000 bp: 2183 Num contigs >= 1000 bp: 20919 Num contigs >= 500 bp: 20919 Num contigs >= 1 bp: 20919 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 27242431 bp Len contigs >= 10000 bp: 71573152 bp Len contigs >= 1000 bp: 142995189 bp Len contigs >= 500 bp: 142995189 bp Len contigs >= 1 bp: 142995189 bp
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • key_plot.png
  • key_plot.pdf
  • cumulative_len_plot.png
  • cumulative_len_plot.pdf
  • sorted_contig_lengths.png
  • sorted_contig_lengths.pdf
  • histogram_figures.zip

Q10 Which assembly should I use to bin the contigs from the rice root iron plaque sample? Why?

View distributions of contig characteristics for different assemblies.
This app completed without errors in 4m 5s.
Summary
ASSEMBLY STATS for peat_soil_SPAdes.contigs Len longest contig: 146993 bp N50 (L50): 3929 (2094) N75 (L75): 2629 (4618) N90 (L90): 2208 (6609) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 2 Num contigs >= 10000 bp: 320 Num contigs >= 1000 bp: 8127 Num contigs >= 500 bp: 8127 Num contigs >= 1 bp: 8127 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 256179 bp Len contigs >= 10000 bp: 5864379 bp Len contigs >= 1000 bp: 31854928 bp Len contigs >= 500 bp: 31854928 bp Len contigs >= 1 bp: 31854928 bp ASSEMBLY STATS for Peat_soil_IDBA.contigs Len longest contig: 95343 bp N50 (L50): 6006 (1377) N75 (L75): 3394 (3218) N90 (L90): 2529 (4889) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 0 Num contigs >= 10000 bp: 557 Num contigs >= 1000 bp: 6303 Num contigs >= 500 bp: 6303 Num contigs >= 1 bp: 6303 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 0 bp Len contigs >= 10000 bp: 9994226 bp Len contigs >= 1000 bp: 32501735 bp Len contigs >= 500 bp: 32501735 bp Len contigs >= 1 bp: 32501735 bp ASSEMBLY STATS for peat_soil_meta_large_MEGAHIT.assembly Len longest contig: 86021 bp N50 (L50): 4312 (2607) N75 (L75): 2749 (5980) N90 (L90): 2232 (8750) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 0 Num contigs >= 10000 bp: 592 Num contigs >= 1000 bp: 10906 Num contigs >= 500 bp: 10906 Num contigs >= 1 bp: 10906 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 0 bp Len contigs >= 10000 bp: 10418064 bp Len contigs >= 1000 bp: 45516267 bp Len contigs >= 500 bp: 45516267 bp Len contigs >= 1 bp: 45516267 bp ASSEMBLY STATS for peat_soil_meta_sensitive_MEGAHIT.assembly Len longest contig: 118411 bp N50 (L50): 4528 (2705) N75 (L75): 2796 (6254) N90 (L90): 2249 (9238) Num contigs >= 1000000 bp: 0 Num contigs >= 100000 bp: 2 Num contigs >= 10000 bp: 639 Num contigs >= 1000 bp: 11580 Num contigs >= 500 bp: 11580 Num contigs >= 1 bp: 11580 Len contigs >= 1000000 bp: 0 bp Len contigs >= 100000 bp: 219122 bp Len contigs >= 10000 bp: 11798301 bp Len contigs >= 1000 bp: 49578301 bp Len contigs >= 500 bp: 49578301 bp Len contigs >= 1 bp: 49578301 bp
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • key_plot.png
  • key_plot.pdf
  • cumulative_len_plot.png
  • cumulative_len_plot.pdf
  • sorted_contig_lengths.png
  • sorted_contig_lengths.pdf
  • histogram_figures.zip

Q11 Which assembly should I use to bin the contigs from the peat soil sample? Why?

Step 6. Bin your Contigs

Alright, you've picked the best assembly, now you'll sort all the contigs into bins that each represent a single genome. This step is called binning the contigs.

App: Bin Contigs using MaxBin2

Timing: 2+ hours depending on the number of contigs and bins present in your sample.

View Configure:

Assembly Object: Put your best assembly here.

Read Library: This is the library the contigs were generated from. If you needed to trim your read library, use the trimmed reads.

Probability Threshold: The confidence the alrogrithm must have for a contig to be placed within a bin. If a contig falls below this cutoff, then it will be left as unclassified. The default is 0.8.

Marker Set: MaxBin2 can bin both bacterial and archaeal genomes. In this case we're only looking at bacteria, so keep it set to the bacterial marker gene set.

Minimum contig length: Any contigs shorter than this will be ignored when binning. 1000 is the default, but above we set our contig minimum length at 2000 so we can increase this to 2000 or leave it as is, since we shouldn't have any contigs shorter than 2000 bases.

Results: The output from this app opens in a new section. The first panel lists the number of bins (and maximum number of genomes) and nucleotides included in all the contigs. The second tab offers some detail about the different bins including marker completenes, GC content, the number of contigs in each bin and their total length. To see information about the individual contigs in a bin, click the bulleted list icon for that bin or the graph beside it. However, these results tell you nothing about the quality of the bins, they could be highly contaminated or contain multiple copies of the same set of genes.

Group assembled metagenomic contigs into lineages (Bins) using depth-of-coverage, nucleotide composition, and marker genes.
This app completed without errors in 1h 12m 17s.
Objects
Created Object Name Type Description
subsurface_gold_mine_meta_sensitive_bins BinnedContigs BinnedContigs from MaxBin2
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • maxbin_result.zip - File(s) generated by MaxBin2 App
Output from Bin Contigs using MaxBin2 - v2.2.4
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/67335

Q12 Delftia acidovorans has a GC content of 66.7%. Based on GC content alone, do any of these bins match D. acidovorans?

Group assembled metagenomic contigs into lineages (Bins) using depth-of-coverage, nucleotide composition, and marker genes.
This app completed without errors in 1h 15m 57s.
Objects
Created Object Name Type Description
hydraulic_fracture_well_fluid_meta_large_bins BinnedContigs BinnedContigs from MaxBin2
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • maxbin_result.zip - File(s) generated by MaxBin2 App
Output from Bin Contigs using MaxBin2 - v2.2.4
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/67335

Q13 What is unusual about the bins from the hydraulic fracture well fluid sample?

Hint: look at the bins tab.

Group assembled metagenomic contigs into lineages (Bins) using depth-of-coverage, nucleotide composition, and marker genes.
This app completed without errors in 1h 21m 6s.
Objects
Created Object Name Type Description
Subsurface_gas_well_spade_bins BinnedContigs BinnedContigs from MaxBin2
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • maxbin_result.zip - File(s) generated by MaxBin2 App
Output from Bin Contigs using MaxBin2 - v2.2.4
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/67335
Group assembled metagenomic contigs into lineages (Bins) using depth-of-coverage, nucleotide composition, and marker genes.
This app completed without errors in 1h 24m 27s.
Objects
Created Object Name Type Description
Rice_root_iron_plaque_SPAdes_bins BinnedContigs BinnedContigs from MaxBin2
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • maxbin_result.zip - File(s) generated by MaxBin2 App
Output from Bin Contigs using MaxBin2 - v2.2.4
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/67335
Group assembled metagenomic contigs into lineages (Bins) using depth-of-coverage, nucleotide composition, and marker genes.
This app completed without errors in 1h 25m 21s.
Objects
Created Object Name Type Description
Peat_soil_meta_sens_contigs BinnedContigs BinnedContigs from MaxBin2
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • maxbin_result.zip - File(s) generated by MaxBin2 App
Output from Bin Contigs using MaxBin2 - v2.2.4
The viewer for the output created by this App is available at the original Narrative here: https://narrative.kbase.us/narrative/67335

Step 7. Check Bin Quality

You should now have a bunch of different bins. Each bin represents a single genome (in theory). In this step we'll check these bins for their completeness, contamination and any duplicates using CheckM.

App: Assess Genome Quality with CheckM

Timing: Depends on the number of bins in your sample and the reference tree you pick

View Configure

Input Assembly, Genome or BinnedContigs: Add in your set of bins from the last step here.

Reference Tree: You can either select the full tree or reduced tree to compare your bins to. The full tree takes longer, but is recommended for a better understanding of what each bin represents. However, if you're tight on time the reduced tree is fine since we'll be generating a species tree to determine close relatives of our assemblies later.

Save all Plots: Save will allow you to download a .zip file of the resulting genome quality plots. Don't save will not.

Results: CheckM will give you two forms of the same report, a graphic version and a table. I think the table is easier to understand, so that's what I'll be covering here. The first column shows the bin name. They're all just numbered bins at this point, but you can rename them later if you want. The second column shows the lineage of the markers present in that bin. Some will be more specific than others, depending on the bin, its completeness and contamination. Number of genomes is the number of genomes used to create the marker set, and number of markers is the number of markers generated. These markers are unique and are expected to occur only once in the genome, replicates indicate contamination. The columns 0 through 5+ indicate the additional copies of these marker genes and are used to calculate contamination. Be aware that contamination is an underestimate in this app.

The last two columns indicate the completeness and contamination of your genome as percents. High quality genomes are over 90% complete with less than 5% contamination. However, since we're just looking to ID if Delftia is present, I'm using any genomes over 75% complete with less than 5% contamination. If an assembly falls outside this range, but looks promising, you can keep it, but be sure to note that it's a low quality assembly.

Write out a list of all the bins you want to keep, it will be useful in the next step.

Runs the CheckM lineage workflow to assess the genome quality of isolates, single cells, or genome bins from metagenome assemblies through comparison to an existing database of genomes.
This app completed without errors in 59m 0s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • CheckM_summary_table.tsv.zip - TSV Summary Table from CheckM
  • full_output.zip - Full output of CheckM
  • plots.zip - Output plots from CheckM

Q14a Which bins would you select to keep working with?

Q14b Why did you pick those bins?

Runs the CheckM lineage workflow to assess the genome quality of isolates, single cells, or genome bins from metagenome assemblies through comparison to an existing database of genomes.
This app completed without errors in 17m 7s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • CheckM_summary_table.tsv.zip - TSV Summary Table from CheckM
  • full_output.zip - Full output of CheckM
  • plots.zip - Output plots from CheckM

Q15 Are any of these bins high enough quality to keep working with? If so, which one(s)?

Runs the CheckM lineage workflow to assess the genome quality of isolates, single cells, or genome bins from metagenome assemblies through comparison to an existing database of genomes.
This app completed without errors in 35m 5s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • CheckM_summary_table.tsv.zip - TSV Summary Table from CheckM
  • full_output.zip - Full output of CheckM
  • plots.zip - Output plots from CheckM
Runs the CheckM lineage workflow to assess the genome quality of isolates, single cells, or genome bins from metagenome assemblies through comparison to an existing database of genomes. Creates a new BinnedContigs object with High Quality bins that pass user-defined thresholds for Completeness and Contamination.
This app completed without errors in 36m 40s.
Objects
Created Object Name Type Description
subsurface_gas_well_spade_CheckM_HQ_bins.BinnedContigs BinnedContigs HQ BinnedContigs subsurface_gas_well_spade_CheckM_HQ_bins.BinnedContigs
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • CheckM_summary_table.tsv.zip - TSV Summary Table from CheckM
  • full_output.zip - Full output of CheckM
  • plots.zip - Output plots from CheckM
Runs the CheckM lineage workflow to assess the genome quality of isolates, single cells, or genome bins from metagenome assemblies through comparison to an existing database of genomes.
This app completed without errors in 49m 7s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • CheckM_summary_table.tsv.zip - TSV Summary Table from CheckM
  • full_output.zip - Full output of CheckM
  • plots.zip - Output plots from CheckM

Q16 This one looks more promising! Which bins would you choose to extract and are there any that look like Delftia?

Runs the CheckM lineage workflow to assess the genome quality of isolates, single cells, or genome bins from metagenome assemblies through comparison to an existing database of genomes.
This app completed without errors in 15m 46s.
Links
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • CheckM_summary_table.tsv.zip - TSV Summary Table from CheckM
  • full_output.zip - Full output of CheckM
  • plots.zip - Output plots from CheckM

Step 8. Extract Assemblies

Up above you checked the quality of all your bins and picked out all the ones that were over 75% complete and less than 5% contaminated. Now you're going to separate them from the contaminated and incomplete assemblies so you just have to work with them.

App: Extract Bins as Assemblies from BinnedContigs

Timing: Depends on the number of bins you're extracting, in general ~10 minutes or so

View Configure:

Binned Contigs: Select the binned contigs set you put into CheckM above. Once you add it the data will automatically fill into the lower Parameters section.

Bin Names Available for Extraction: There is a green plus on the right side of all your bins. Click it to select the ones you want to save as assemblies. They will appear in the lower table. Once you're done, double check that they're all there and that you got the right ones.

Assembly Name Suffix: Your bins will be renamed with this added. It should be a descriptive suffix so you can tell them apart, because any extracted bins will start with Bin###.fasta

AssemblySet Name: This will be the name of your assembly set that contains all your extracted bins. Again, it should be named something descriptive. For example, I named the first assembly set : Subsurface_gold_mine_extracted_bins.AssemblySet

If you have just one bin to extract your results will just be an assembly, because an AssemblySet needs to include 2 or more assemblies. This won't produce an error message.

Results: Your results will include a table of the different assemblies and a note that the job finished successfully.

Extract a bin as an Assembly from a BinnedContig dataset
This app completed without errors in 7m 5s.
Objects
Created Object Name Type Description
subsurface_gold_mine_extracted_bins.AssemblySet AssemblySet Assembly set of extracted assemblies
Bin.009.fasta_subsurface_gold_mine_assembly Assembly Assembly object of extracted contigs
Bin.011.fasta_subsurface_gold_mine_assembly Assembly Assembly object of extracted contigs
Bin.013.fasta_subsurface_gold_mine_assembly Assembly Assembly object of extracted contigs
Bin.015.fasta_subsurface_gold_mine_assembly Assembly Assembly object of extracted contigs
Bin.017.fasta_subsurface_gold_mine_assembly Assembly Assembly object of extracted contigs
Bin.019.fasta_subsurface_gold_mine_assembly Assembly Assembly object of extracted contigs
Bin.020.fasta_subsurface_gold_mine_assembly Assembly Assembly object of extracted contigs
Bin.022.fasta_subsurface_gold_mine_assembly Assembly Assembly object of extracted contigs
Bin.033.fasta_subsurface_gold_mine_assembly Assembly Assembly object of extracted contigs
Bin.034.fasta_subsurface_gold_mine_assembly Assembly Assembly object of extracted contigs
Bin.035.fasta_subsurface_gold_mine_assembly Assembly Assembly object of extracted contigs
Bin.036.fasta_subsurface_gold_mine_assembly Assembly Assembly object of extracted contigs
Bin.040.fasta_subsurface_gold_mine_assembly Assembly Assembly object of extracted contigs
Summary
Job Finished Generated Assembly Reference: 67335/147/1, 67335/148/1, 67335/149/1, 67335/150/1, 67335/152/1, 67335/153/1, 67335/154/1, 67335/155/1, 67335/156/1, 67335/157/1, 67335/158/1, 67335/159/1, 67335/160/1 Generated Assembly Set: 67335/161/1
Extract a bin as an Assembly from a BinnedContig dataset
This app completed without errors in 1m 25s.
Objects
Created Object Name Type Description
Bin.004.fasta_assembly Assembly Assembly object of extracted contigs
Summary
Job Finished Generated Assembly Reference: 67335/112/1
Extract a bin as an Assembly from a BinnedContig dataset
This app completed without errors in 2m 28s.
Objects
Created Object Name Type Description
subsurface_gas_well_spade_extracted_bins.AssemblySet AssemblySet Assembly set of extracted assemblies
Bin.003.fasta_assembly Assembly Assembly object of extracted contigs
Bin.005.fasta_assembly Assembly Assembly object of extracted contigs
Bin.013.fasta_assembly Assembly Assembly object of extracted contigs
Summary
Job Finished Generated Assembly Reference: 67335/95/1, 67335/96/1, 67335/97/1 Generated Assembly Set: 67335/98/1
Extract a bin as an Assembly from a BinnedContig dataset
This app completed without errors in 5m 43s.
Objects
Created Object Name Type Description
rice_root_iron_plaque_extracted_bins.AssemblySet AssemblySet Assembly set of extracted assemblies
Bin.001.fasta_rice_root_plaque_assembly Assembly Assembly object of extracted contigs
Bin.002.fasta_rice_root_plaque_assembly Assembly Assembly object of extracted contigs
Bin.005.fasta_rice_root_plaque_assembly Assembly Assembly object of extracted contigs
Bin.008.fasta_rice_root_plaque_assembly Assembly Assembly object of extracted contigs
Bin.009.fasta_rice_root_plaque_assembly Assembly Assembly object of extracted contigs
Bin.010.fasta_rice_root_plaque_assembly Assembly Assembly object of extracted contigs
Bin.012.fasta_rice_root_plaque_assembly Assembly Assembly object of extracted contigs
Bin.014.fasta_rice_root_plaque_assembly Assembly Assembly object of extracted contigs
Bin.015.fasta_rice_root_plaque_assembly Assembly Assembly object of extracted contigs
Bin.016.fasta_rice_root_plaque_assembly Assembly Assembly object of extracted contigs
Bin.026.fasta_rice_root_plaque_assembly Assembly Assembly object of extracted contigs
Summary
Job Finished Generated Assembly Reference: 67335/119/1, 67335/120/1, 67335/121/1, 67335/122/1, 67335/123/1, 67335/124/1, 67335/125/1, 67335/128/1, 67335/129/1, 67335/130/1, 67335/132/1 Generated Assembly Set: 67335/133/1
Extract a bin as an Assembly from a BinnedContig dataset
This app completed without errors in 2m 60s.
Objects
Created Object Name Type Description
peat_soil_extracted_bins.AssemblySet AssemblySet Assembly set of extracted assemblies
Bin.001.fastapeat_soil_assembly Assembly Assembly object of extracted contigs
Bin.007.fastapeat_soil_assembly Assembly Assembly object of extracted contigs
Bin.009.fastapeat_soil_assembly Assembly Assembly object of extracted contigs
Summary
Job Finished Generated Assembly Reference: 67335/201/1, 67335/202/1, 67335/203/1 Generated Assembly Set: 67335/204/1

Step 9. Annotate Your Assemblies

You now have a set of mostly-whole, mostly-uncontaminated genomes from your sample. Now you'll use RAST to identify genes in these assemblies.

App: Annotate Multiple Microbial Assemblies (RAST)

Timing: Depends on the number of assemblies and their size. I'd estimate it takes about 10 mins an assembly.

View Configure:

Assemblies/AssemblySets: Here is where you add the AssemblySet you generated in the last step. You could add your assemblies individually, but it's easier to add them all as one set.

Domain and Genetic Code: Both should be set for bacteria, since D. acidovorans is a bacteria.

Call Buttons: By default most of these are checked. For something like a genome assembly, it's good to grab more features than we need in case we need them for a future study.

Results: Your results from RAST are fairly simple. You'll get objects for each assembly and one set of all assemblies annotated together. The summary will give you a short description of what was annotated in each genome and if the annotation was a success. Check through the summary to make sure none of the assemblies failed. Sometimes one will fail, but the app will still give you a success message. If it does fail, try annotating that assembly alone using the app: Annotate Microbial Assembly with different settings.

Annotate bacterial or archaeal assemblies and/or assembly sets using RASTtk.
This app completed without errors in 41m 11s.
Summary
The RAST algorithm was applied to annotating a genome sequence comprised of 252 contigs containing 3023957 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3829 new features were called, of which 305 are non-coding.
Output genome has the following feature types:
	Coding gene                     3524 
	Non-coding crispr_array            3 
	Non-coding crispr_repeat          22 
	Non-coding crispr_spacer          19 
	Non-coding repeat                214 
	Non-coding rna                    47 
Overall, the genes have 1296 distinct functions. 
The genes include 1474 genes with a SEED annotation ontology across 814 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.036.fasta_subsurface_gold_mine_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 40 contigs containing 1396467 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 2005 new features were called, of which 202 are non-coding.
Output genome has the following feature types:
	Coding gene                     1803 
	Non-coding crispr_array            3 
	Non-coding crispr_repeat          42 
	Non-coding crispr_spacer          41 
	Non-coding repeat                 65 
	Non-coding rna                    51 
Overall, the genes have 749 distinct functions. 
The genes include 798 genes with a SEED annotation ontology across 558 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.011.fasta_subsurface_gold_mine_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 41 contigs containing 1401304 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 1470 new features were called, of which 91 are non-coding.
Output genome has the following feature types:
	Coding gene                     1379 
	Non-coding repeat                 44 
	Non-coding rna                    47 
Overall, the genes have 432 distinct functions. 
The genes include 594 genes with a SEED annotation ontology across 313 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.013.fasta_subsurface_gold_mine_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 45 contigs containing 3965585 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3523 new features were called, of which 125 are non-coding.
Output genome has the following feature types:
	Coding gene                     3398 
	Non-coding crispr_array            1 
	Non-coding crispr_repeat           6 
	Non-coding crispr_spacer           5 
	Non-coding repeat                 68 
	Non-coding rna                    45 
Overall, the genes have 1237 distinct functions. 
The genes include 1910 genes with a SEED annotation ontology across 774 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.022.fasta_subsurface_gold_mine_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 241 contigs containing 3603719 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3603 new features were called, of which 413 are non-coding.
Output genome has the following feature types:
	Coding gene                     3190 
	Non-coding crispr_array            1 
	Non-coding crispr_repeat          13 
	Non-coding crispr_spacer          12 
	Non-coding repeat                351 
	Non-coding rna                    36 
Overall, the genes have 1073 distinct functions. 
The genes include 1723 genes with a SEED annotation ontology across 675 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.033.fasta_subsurface_gold_mine_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 90 contigs containing 1414679 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3067 new features were called, of which 87 are non-coding.
Output genome has the following feature types:
	Coding gene                     2980 
	Non-coding repeat                 51 
	Non-coding rna                    36 
Overall, the genes have 140 distinct functions. 
The genes include 277 genes with a SEED annotation ontology across 103 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.034.fasta_subsurface_gold_mine_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 84 contigs containing 3078190 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3001 new features were called, of which 120 are non-coding.
Output genome has the following feature types:
	Coding gene                     2881 
	Non-coding repeat                 75 
	Non-coding rna                    45 
Overall, the genes have 1088 distinct functions. 
The genes include 1490 genes with a SEED annotation ontology across 733 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.035.fasta_subsurface_gold_mine_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 102 contigs containing 4104362 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 4141 new features were called, of which 290 are non-coding.
Output genome has the following feature types:
	Coding gene                     3851 
	Non-coding crispr_array            1 
	Non-coding crispr_repeat          49 
	Non-coding crispr_spacer          48 
	Non-coding repeat                135 
	Non-coding rna                    57 
Overall, the genes have 1484 distinct functions. 
The genes include 1918 genes with a SEED annotation ontology across 875 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.009.fasta_subsurface_gold_mine_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 6 contigs containing 574605 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 687 new features were called, of which 55 are non-coding.
Output genome has the following feature types:
	Coding gene                      632 
	Non-coding repeat                 10 
	Non-coding rna                    45 
Overall, the genes have 246 distinct functions. 
The genes include 337 genes with a SEED annotation ontology across 204 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.017.fasta_subsurface_gold_mine_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 22 contigs containing 1075402 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 1247 new features were called, of which 56 are non-coding.
Output genome has the following feature types:
	Coding gene                     1191 
	Non-coding repeat                  7 
	Non-coding rna                    49 
Overall, the genes have 361 distinct functions. 
The genes include 520 genes with a SEED annotation ontology across 278 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.019.fasta_subsurface_gold_mine_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 87 contigs containing 2205918 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 2368 new features were called, of which 131 are non-coding.
Output genome has the following feature types:
	Coding gene                     2237 
	Non-coding crispr_array            1 
	Non-coding crispr_repeat           5 
	Non-coding crispr_spacer           4 
	Non-coding repeat                 87 
	Non-coding rna                    34 
Overall, the genes have 1438 distinct functions. 
The genes include 1224 genes with a SEED annotation ontology across 851 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.020.fasta_subsurface_gold_mine_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 339 contigs containing 3344248 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3842 new features were called, of which 171 are non-coding.
Output genome has the following feature types:
	Coding gene                     3671 
	Non-coding crispr_array            2 
	Non-coding crispr_repeat          41 
	Non-coding crispr_spacer          39 
	Non-coding repeat                 48 
	Non-coding rna                    41 
Overall, the genes have 1737 distinct functions. 
The genes include 1757 genes with a SEED annotation ontology across 1014 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.040.fasta_subsurface_gold_mine_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 85 contigs containing 2402219 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 2888 new features were called, of which 310 are non-coding.
Output genome has the following feature types:
	Coding gene                     2578 
	Non-coding crispr_array            2 
	Non-coding crispr_repeat          65 
	Non-coding crispr_spacer          63 
	Non-coding repeat                130 
	Non-coding rna                    50 
Overall, the genes have 1212 distinct functions. 
The genes include 1265 genes with a SEED annotation ontology across 788 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.015.fasta_subsurface_gold_mine_assembly succeeded!

Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • annotation_report.subsurface_gold_mine_extracted_bins_annotated - Microbial Annotation Report

Q17 Did all assembled genomes annotate correctly?

Annotate bacterial or archaeal assemblies and/or assembly sets using RASTtk.
This app completed without errors in 5m 55s.
Objects
Created Object Name Type Description
Bin.004.fasta_assembly.RAST Genome Annotated genome
hydraulic_fracture_well_Bin.004_annotated_assembly GenomeSet Genome Set
Summary
The RAST algorithm was applied to annotating a genome sequence comprised of 75 contigs containing 3624967 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 4069 new features were called, of which 680 are non-coding.
Output genome has the following feature types:
	Coding gene                     3389 
	Non-coding crispr_array            3 
	Non-coding crispr_repeat         247 
	Non-coding crispr_spacer         244 
	Non-coding repeat                129 
	Non-coding rna                    57 
Overall, the genes have 2271 distinct functions. 
The genes include 1778 genes with a SEED annotation ontology across 1273 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.004.fasta_assembly succeeded!

Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • annotation_report.hydraulic_fracture_well_Bin.004_annotated_assembly - Microbial Annotation Report
Annotate bacterial or archaeal assemblies and/or assembly sets using RASTtk.
This app completed without errors in 20m 25s.
Objects
Created Object Name Type Description
Bin.003.fasta_assembly.RAST Genome Annotated genome
Bin.013.fasta_assembly.RAST Genome Annotated genome
Bin.005.fasta_assembly.RAST Genome Annotated genome
subsurface_gas_well_extracted_bins_annotated GenomeSet Genome Set
Summary
The RAST algorithm was applied to annotating a genome sequence comprised of 85 contigs containing 4601080 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 4522 new features were called, of which 158 are non-coding.
Output genome has the following feature types:
	Coding gene                     4364 
	Non-coding repeat                104 
	Non-coding rna                    54 
Overall, the genes have 2996 distinct functions. 
The genes include 1606 genes with a SEED annotation ontology across 1174 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.003.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 191 contigs containing 4467743 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 4441 new features were called, of which 90 are non-coding.
Output genome has the following feature types:
	Coding gene                     4351 
	Non-coding repeat                 49 
	Non-coding rna                    41 
Overall, the genes have 2662 distinct functions. 
The genes include 1851 genes with a SEED annotation ontology across 1186 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.013.fasta_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 127 contigs containing 4858083 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 4781 new features were called, of which 120 are non-coding.
Output genome has the following feature types:
	Coding gene                     4661 
	Non-coding repeat                 46 
	Non-coding rna                    74 
Overall, the genes have 2211 distinct functions. 
The genes include 1980 genes with a SEED annotation ontology across 1116 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.005.fasta_assembly succeeded!

Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • annotation_report.subsurface_gas_well_extracted_bins_annotated - Microbial Annotation Report
Annotate bacterial or archaeal assemblies and/or assembly sets using RASTtk.
This app completed without errors in 56m 25s.
Summary
The RAST algorithm was applied to annotating a genome sequence comprised of 328 contigs containing 4422408 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 4368 new features were called, of which 84 are non-coding.
Output genome has the following feature types:
	Coding gene                     4284 
	Non-coding repeat                 20 
	Non-coding rna                    64 
Overall, the genes have 2995 distinct functions. 
The genes include 1967 genes with a SEED annotation ontology across 1474 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.012.fasta_rice_root_plaque_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 21 contigs containing 3884316 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3412 new features were called, of which 50 are non-coding.
Output genome has the following feature types:
	Coding gene                     3362 
	Non-coding repeat                 10 
	Non-coding rna                    40 
Overall, the genes have 1625 distinct functions. 
The genes include 1522 genes with a SEED annotation ontology across 854 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.010.fasta_rice_root_plaque_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 34 contigs containing 3177223 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3105 new features were called, of which 113 are non-coding.
Output genome has the following feature types:
	Coding gene                     2992 
	Non-coding repeat                 56 
	Non-coding rna                    57 
Overall, the genes have 1576 distinct functions. 
The genes include 1395 genes with a SEED annotation ontology across 790 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.009.fasta_rice_root_plaque_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 64 contigs containing 6397280 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 6032 new features were called, of which 81 are non-coding.
Output genome has the following feature types:
	Coding gene                     5951 
	Non-coding repeat                 31 
	Non-coding rna                    50 
Overall, the genes have 3129 distinct functions. 
The genes include 2588 genes with a SEED annotation ontology across 1458 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.008.fasta_rice_root_plaque_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 123 contigs containing 5324791 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 5069 new features were called, of which 70 are non-coding.
Output genome has the following feature types:
	Coding gene                     4999 
	Non-coding repeat                 29 
	Non-coding rna                    41 
Overall, the genes have 2933 distinct functions. 
The genes include 2304 genes with a SEED annotation ontology across 1385 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.015.fasta_rice_root_plaque_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 62 contigs containing 6676535 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 6156 new features were called, of which 117 are non-coding.
Output genome has the following feature types:
	Coding gene                     6039 
	Non-coding repeat                 51 
	Non-coding rna                    66 
Overall, the genes have 2270 distinct functions. 
The genes include 2399 genes with a SEED annotation ontology across 996 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.014.fasta_rice_root_plaque_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 27 contigs containing 3919630 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3807 new features were called, of which 70 are non-coding.
Output genome has the following feature types:
	Coding gene                     3737 
	Non-coding repeat                 19 
	Non-coding rna                    51 
Overall, the genes have 2059 distinct functions. 
The genes include 1896 genes with a SEED annotation ontology across 1163 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.001.fasta_rice_root_plaque_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 560 contigs containing 3298417 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3683 new features were called, of which 27 are non-coding.
Output genome has the following feature types:
	Coding gene                     3656 
	Non-coding repeat                  2 
	Non-coding rna                    25 
Overall, the genes have 2252 distinct functions. 
The genes include 1440 genes with a SEED annotation ontology across 955 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.026.fasta_rice_root_plaque_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 19 contigs containing 3296343 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3242 new features were called, of which 52 are non-coding.
Output genome has the following feature types:
	Coding gene                     3190 
	Non-coding repeat                  8 
	Non-coding rna                    44 
Overall, the genes have 2006 distinct functions. 
The genes include 1677 genes with a SEED annotation ontology across 1085 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.005.fasta_rice_root_plaque_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 216 contigs containing 4630571 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 4695 new features were called, of which 71 are non-coding.
Output genome has the following feature types:
	Coding gene                     4624 
	Non-coding repeat                 19 
	Non-coding rna                    52 
Overall, the genes have 3392 distinct functions. 
The genes include 2041 genes with a SEED annotation ontology across 1681 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.002.fasta_rice_root_plaque_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 131 contigs containing 4160543 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 4100 new features were called, of which 79 are non-coding.
Output genome has the following feature types:
	Coding gene                     4021 
	Non-coding repeat                 35 
	Non-coding rna                    44 
Overall, the genes have 2272 distinct functions. 
The genes include 2144 genes with a SEED annotation ontology across 1142 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.016.fasta_rice_root_plaque_assembly succeeded!

Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • annotation_report.rice_root_iron_plaque_annotated_extracted_bins - Microbial Annotation Report
Annotate bacterial or archaeal assemblies and/or assembly sets using RASTtk.
This app completed without errors in 18m 50s.
Objects
Created Object Name Type Description
Bin.001.fastapeat_soil_assembly.RAST Genome Annotated genome
Bin.007.fastapeat_soil_assembly.RAST Genome Annotated genome
Bin.009.fastapeat_soil_assembly.RAST Genome Annotated genome
Peat_soil_extracted_bins_annotated GenomeSet Genome Set
Summary
The RAST algorithm was applied to annotating a genome sequence comprised of 173 contigs containing 1950668 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 2844 new features were called, of which 218 are non-coding.
Output genome has the following feature types:
	Coding gene                     2626 
	Non-coding repeat                180 
	Non-coding rna                    38 
Overall, the genes have 774 distinct functions. 
The genes include 905 genes with a SEED annotation ontology across 468 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.001.fastapeat_soil_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 186 contigs containing 2021755 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 2229 new features were called, of which 70 are non-coding.
Output genome has the following feature types:
	Coding gene                     2159 
	Non-coding repeat                 26 
	Non-coding rna                    44 
Overall, the genes have 877 distinct functions. 
The genes include 1117 genes with a SEED annotation ontology across 626 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.007.fastapeat_soil_assembly succeeded!

The RAST algorithm was applied to annotating a genome sequence comprised of 492 contigs containing 2734491 nucleotides. 
No initial gene calls were provided.
Standard features were called using: glimmer3; prodigal.
A scan was conducted for the following additional feature types: rRNA; tRNA; selenoproteins; pyrrolysoproteins; repeat regions; crispr.
The genome features were functionally annotated using the following algorithm(s): Kmers V2; Kmers V1; protein similarity.
In addition to the remaining original 0 coding features and 0 non-coding features, 3214 new features were called, of which 138 are non-coding.
Output genome has the following feature types:
	Coding gene                     3076 
	Non-coding repeat                108 
	Non-coding rna                    30 
Overall, the genes have 1139 distinct functions. 
The genes include 1365 genes with a SEED annotation ontology across 736 distinct SEED functions.
The number of distinct functions can exceed the number of genes because some genes have multiple functions.
Bin.009.fastapeat_soil_assembly succeeded!

Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • annotation_report.Peat_soil_extracted_bins_annotated - Microbial Annotation Report

Step 10. Identify Your Assemblies

Now that you've annotated your assemblies, it's time to figure out what they are!

App: GTDB-Tk classify

Timing: about 30 minutes, depending on the number of assemblies and the queue time

View Configure:

Assembly Input: Add in your annotated assembly set from above.

Minimum Alignment Percent: This will filter out genomes with an insufficient percentage of AAs in the MSA generated by the app. The default is 10, if you want to increase the specificity all you need to do is increase this percentage. In my runs below I've kept the default as it is.

Results: The results of this app are across 4 tabs. The first tab is a table for bacteria and the second shows the same table for archaea. The first column indicates the bin and the second indicates the classification of that bin based on GenBank and RefSeq databases. The middle columns offer information about how this classification was determined. The right-most column is also important, since it will note any concerns about your genome. For example, if it contains high levels of contamination.

Obtain objective taxonomic assignments for bacterial and archaeal genomes based on the Genome Taxonomy Database (GTDB) ver 1.1.0
This app completed without errors in 49m 10s.
Links

Q18 Were any of the assembled genomes from Delftia? If not, what did you find instead?

Obtain objective taxonomic assignments for bacterial and archaeal genomes based on the Genome Taxonomy Database (GTDB) ver 1.1.0
This app completed without errors in 40m 2s.
Links

Q19 Was the assembled genome from Delftia? If not, what was it instead?

Obtain objective taxonomic assignments for bacterial and archaeal genomes based on the Genome Taxonomy Database (GTDB) ver 1.1.0
This app completed without errors in 1h 3m 24s.
Links
Obtain objective taxonomic assignments for bacterial and archaeal genomes based on the Genome Taxonomy Database (GTDB) ver 1.1.0
This app completed without errors in 50m 11s.
Links
Obtain objective taxonomic assignments for bacterial and archaeal genomes based on the Genome Taxonomy Database (GTDB) ver 1.1.0
This app completed without errors in 41m 11s.
Links

Step 11. Find Relatives

Lastly, I'll identify close relatives to my bins just to establish some addtional phylogenetic context for them.

App: Insert Set of Genomes into SpeciesTree OR Insert Genome into Species Tree

  • If you only have one decent quality genome, use the insert genome app.

Timing: 5-10 minutes

View Configure:

Genome Set: Use the annotated genome set from RAST that contains all your annotated bins from a sample.

Neighbor Public Genome Count: This is the number of additional genomes that will be added to the phylogenetic tree.

Copy Public Genomes to Workspace: Checking this box will add all the new genomes from the species tree into your data panel on the left. If you're stopping here you don't need to do this, but some analyses you would perform after this step might require you to save these genomes.

Output Tree: Name the tree that will be produced.

Output GenomeSet: Name the new GenomeSet (more important if you're saving all the public genomes)

Results: This app will generate a tree showing your assembled genomes highlighted in blue and additional genomes in white.

Add a user-provided GenomeSet to a KBase SpeciesTree.
This app produced errors.
No output found.
v1 - KBaseTrees.Tree-1.0
The viewer for the data in this Cell is available at the original Narrative here: https://narrative.kbase.us/narrative/67335
Add one or more Genomes to a KBase SpeciesTree.
This app completed without errors in 3m 15s.
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • hydraulic_fracture_fluid_bin004_tree.newick
  • hydraulic_fracture_fluid_bin004_tree-labels.newick
  • hydraulic_fracture_fluid_bin004_tree.png
  • hydraulic_fracture_fluid_bin004_tree.pdf
Add a user-provided GenomeSet to a KBase SpeciesTree.
This app completed without errors in 4m 35s.
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • subsurface_gas_well_genomes_tree.newick
  • subsurface_gas_well_genomes_tree-labels.newick
  • subsurface_gas_well_genomes_tree.png
  • subsurface_gas_well_genomes_tree.pdf
Add a user-provided GenomeSet to a KBase SpeciesTree.
This app completed without errors in 5m 53s.
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • rice_root_iron_plaque_bin_tree.newick
  • rice_root_iron_plaque_bin_tree-labels.newick
  • rice_root_iron_plaque_bin_tree.png
  • rice_root_iron_plaque_bin_tree.pdf

Q20 Did any of the genomes not have any close relatives indicated on the phylogenetic tree? Why do you think this happened?

Add a user-provided GenomeSet to a KBase SpeciesTree.
This app completed without errors in 3m 55s.
Files
These are only available in the live Narrative: https://narrative.kbase.us/narrative/67335
  • peat_soil_bins_tree.newick
  • peat_soil_bins_tree-labels.newick
  • peat_soil_bins_tree.png
  • peat_soil_bins_tree.pdf

Further Steps

Unfortunately I didn't find any Delftia in these samples, but if I did, I could run further analyses including metabolic modeling, multiple sequence alignments (MSA) of any genes of interest and further domain annotations.

The apps are inserted below, if you do find Delftia in your sample and are curious.

Generate a draft metabolic model based on an annotated genome.
This app is new, and hasn't been started.
No output found.
Build a Multiple Sequence Alignment (MSA) for nucleotide sequences using MUSCLE.
This app is new, and hasn't been started.
No output found.
Build a Multiple Sequence Alignment (MSA) for protein sequences using MUSCLE.
This app is new, and hasn't been started.
No output found.
Annotate a Genome object with protein domains from widely used domain libraries.
This app is new, and hasn't been started.
No output found.

References

  1. Johnston, C., Wyatt, M., Li, X. et al. Gold biomineralization by a metallophore from a gold-associated microbe. Nat Chem Biol 9, 241–243 (2013). https://doi.org/10.1038/nchembio.1179
  2. http://2013.igem.org/Team:Heidelberg/Project/Delftibactin
  3. Perry, Benjamin J et al. “Complete Genome Sequence of Delftia acidovorans RAY209, a Plant Growth-Promoting Rhizobacterium for Canola and Soybean.” Genome announcements vol. 5,44 e01224-17. 2 Nov. 2017, doi:10.1128/genomeA.01224-17
  4. Image created by Lauren Ramilo.

Apps

  1. Annotate Domains in a Genome
    • Altschul SF, Madden TL, Sch ffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25: 3389 3402. doi:10.1093/nar/25.17.3389
    • Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10: 421. doi:10.1186/1471-2105-10-421
    • Eddy SR. Accelerated Profile HMM Searches. PLOS Computational Biology. 2011;7: e1002195. doi:10.1371/journal.pcbi.1002195
    • Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 2016;44: D279 D285. doi:10.1093/nar/gkv1344
    • Haft DH, Selengut JD, Richter RA, Harkins D, Basu MK, Beck E. TIGRFAMs and Genome Properties in 2013. Nucleic Acids Res. 2013;41: D387 D395. doi:10.1093/nar/gks1234
    • Letunic I, Bork P. 20 years of the SMART protein domain annotation resource. Nucleic Acids Res. 2018;46: D493 D496. doi:10.1093/nar/gkx922
    • Letunic I, Doerks T, Bork P. SMART: recent updates, new developments and status in 2015. Nucleic Acids Res. 2015;43: D257-260. doi:10.1093/nar/gku949
    • Marchler-Bauer A, Bo Y, Han L, He J, Lanczycki CJ, Lu S, et al. CDD/SPARCLE: functional classification of proteins via subfamily domain architectures. Nucleic Acids Res. 2017;45: D200 D203. doi:10.1093/nar/gkw1129
    • Selengut JD, Haft DH, Davidsen T, Ganapathy A, Gwinn-Giglio M, Nelson WC, et al. TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes. Nucleic Acids Res. 2007;35: D260-264. doi:10.1093/nar/gkl1043
    • Tatusov RL, Koonin EV, Lipman DJ. A Genomic Perspective on Protein Families. Science. 1997;278: 631 637. doi:10.1126/science.278.5338.631
  2. Annotate Multiple Microbial Assemblies
    • [1] Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, et al. The RAST Server: Rapid Annotations using Subsystems Technology. BMC Genomics. 2008;9: 75. doi:10.1186/1471-2164-9-75
    • [2] Overbeek R, Olson R, Pusch GD, Olsen GJ, Davis JJ, Disz T, et al. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res. 2014;42: D206 D214. doi:10.1093/nar/gkt1226
    • [3] Brettin T, Davis JJ, Disz T, Edwards RA, Gerdes S, Olsen GJ, et al. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes. Sci Rep. 2015;5. doi:10.1038/srep08365
    • [4] Kent WJ. BLAT The BLAST-Like Alignment Tool. Genome Res. 2002;12: 656 664. doi:10.1101/gr.229202
    • [5] Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10: 421. doi:10.1186/1471-2105-10-421
    • [6] Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25: 955 964.
    • [7] Cobucci-Ponzano B, Rossi M, Moracci M. Translational recoding in archaea. Extremophiles. 2012;16: 793 803. doi:10.1007/s00792-012-0482-8
    • [8] Siguier P, Perochon J, Lestrade L, Mahillon J, Chandler M. ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Res. 2006;34: D32 D36. doi:10.1093/nar/gkj014
    • [9] van Belkum A, Sluijuter M, de Groot R, Verbrugh H, Hermans PW. Novel BOX repeat PCR assay for high-resolution typing of Streptococcus pneumoniae strains. J Clin Microbiol. 1996;34: 1176 1179.
    • [10] Croucher NJ, Vernikos GS, Parkhill J, Bentley SD. Identification, variation and transcription of pneumococcal repeat sequences. BMC Genomics. 2011;12: 120. doi:10.1186/1471-2164-12-120
    • [11] Hyatt D, Chen G-L, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11: 119. doi:10.1186/1471-2105-11-119
    • [12] Delcher AL, Bratke KA, Powers EC, Salzberg SL. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007;23: 673 679. doi:10.1093/bioinformatics/btm009
    • [13] Akhter S, Aziz RK, Edwards RA. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res. 2012;40: e126. doi:10.1093/nar/gks406
  3. Assemble Reads with IDBA-UD - v1.1.3
    • Peng Y, Leung HCM, Yiu SM, Chin FYL. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28: 1420 1428. doi:10.1093/bioinformatics/bts174
  4. Assemble Reads with MEGAHIT v1.2.9
    • Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015;31: 1674 1676. doi:10.1093/bioinformatics/btv033
  5. Assemble Reads with metaSPAdes - v3.13.0
    • Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017; 27:824 834. doi: 10.1101/gr.213959.116
  6. Assess Genome Quality with CheckM - v1.0.18
    • Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25: 1043 1055. doi:10.1101/gr.186072.114
    • CheckM source:
    • Additional info:
  7. Assess Read Quality with FastQC - v0.11.5
    • FastQC source: Bioinformatics Group at the Babraham Institute, UK.
  8. Bin Contigs using MaxBin2 - v2.2.4
    • Wu Y-W, Simmons BA, Singer SW. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics. 2016;32: 605 607. doi:10.1093/bioinformatics/btv638 (2) 1. Wu Y-W, Tang Y-H, Tringe SG, Simmons BA, Singer SW. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome. 2014;2: 26. doi:10.1186/2049-2618-2-26
    • Wu Y-W, Tang Y-H, Tringe SG, Simmons BA, Singer SW. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome. 2014;2: 26. doi:10.1186/2049-2618-2-26
    • Maxbin2 source:
    • Maxbin source:
  9. Build Metabolic Model
    • [1] Henry CS, DeJongh M, Best AA, Frybarger PM, Linsay B, Stevens RL. High-throughput generation, optimization and analysis of genome-scale metabolic models. Nat Biotechnol. 2010;28: 977 982. doi:10.1038/nbt.1672
    • [2] Overbeek R, Olson R, Pusch GD, Olsen GJ, Davis JJ, Disz T, et al. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res. 2014;42: D206 D214. doi:10.1093/nar/gkt1226
    • [3] Latendresse M. Efficiently gap-filling reaction networks. BMC Bioinformatics. 2014;15: 225. doi:10.1186/1471-2105-15-225
    • [4] Dreyfuss JM, Zucker JD, Hood HM, Ocasio LR, Sachs MS, Galagan JE. Reconstruction and Validation of a Genome-Scale Metabolic Model for the Filamentous Fungus Neurospora crassa Using FARM. PLOS Computational Biology. 2013;9: e1003126. doi:10.1371/journal.pcbi.1003126
    • [5] Mahadevan R, Schilling CH. The effects of alternate optimal solutions in constraint-based genome-scale metabolic models. Metab Eng. 2003;5: 264 276.
  10. Classify Taxonomy of Metagenomic Reads with GOTTCHA2 - v2.1.6
    • Tracey Allen K. Freitas, Po-E Li, Matthew B. Scholz and Patrick S. G. Chain (2015) Accurate read-based metagenome characterization using a hierarchical suite of unique signatures, Nucleic Acids Research (DOI: 10.1093/nar/gkv180)
    • GOTTCHA2 DBs from:
    • Krona homepage:
    • Github for Krona:
  11. Classify Taxonomy of Metagenomic Reads with Kaiju - v1.7.2
    • Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun. 2016;7: 11257. doi:10.1038/ncomms11257
    • Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics. 2011;12: 385. doi:10.1186/1471-2105-12-385
    • Kaiju Homepage:
    • Kaiju DBs from:
    • Github for Kaiju:
    • Krona homepage:
    • Github for Krona:
  12. Compare Assembled Contig Distributions - v1.1.2
    • Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163
  13. Extract Bins as Assemblies from BinnedContigs - v1.0.2
    • Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163
  14. Filter Bins by Quality with CheckM - v1.0.18
    • Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25: 1043 1055. doi:10.1101/gr.186072.114
    • CheckM source:
    • Additional info:
  15. GTDB-Tk classify
    • Pierre-Alain Chaumeil, Aaron J Mussig, Philip Hugenholtz, Donovan H Parks, GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database, Bioinformatics, Volume 36, Issue 6, 15 March 2020, Pages 1925 1927. DOI: https://doi.org/10.1093/bioinformatics/btz848
    • Parks, D., Chuvochina, M., Waite, D. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol 36, 996 1004 (2018). DOI: https://doi.org/10.1038/nbt.4229
    • Parks DH, Chuvochina M, Chaumeil PA, Rinke C, Mussig AJ, Hugenholtz P. A complete domain-to-species taxonomy for Bacteria and Archaea [published online ahead of print, 2020 Apr 27]. Nat Biotechnol. 2020;10.1038/s41587-020-0501-8. DOI:10.1038/s41587-020-0501-8
    • Matsen FA, Kodner RB, Armbrust EV. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics. 2010;11:538. Published 2010 Oct 30. doi:10.1186/1471-2105-11-538
    • Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun. 2018;9(1):5114. Published 2018 Nov 30. DOI:10.1038/s41467-018-07641-9
    • Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11:119. Published 2010 Mar 8. DOI:10.1186/1471-2105-11-119
    • Price MN, Dehal PS, Arkin AP. FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One. 2010;5(3):e9490. Published 2010 Mar 10. DOI:10.1371/journal.pone.0009490 link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2835736/
    • Eddy SR. Accelerated Profile HMM Searches. PLoS Comput Biol. 2011;7(10):e1002195. DOI:10.1371/journal.pcbi.1002195
  16. Import SRA File as Reads From Web - v1.0.7
    • Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163
  17. Insert Genome Into SpeciesTree - v2.2.0
    • Price MN, Dehal PS, Arkin AP. FastTree 2 Approximately Maximum-Likelihood Trees for Large Alignments. PLoS One. 2010;5. doi:10.1371/journal.pone.0009490
  18. Insert Set of Genomes Into SpeciesTree - v2.2.0
    • Price MN, Dehal PS, Arkin AP. FastTree 2 Approximately Maximum-Likelihood Trees for Large Alignments. PLoS One. 2010;5. doi:10.1371/journal.pone.0009490
  19. MUSCLE Multiple Sequence Alignment (DNA) - v3.8.425
    • Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32: 1792 1797. doi:10.1093/nar/gkh340
    • MUSCLE 3.8.425 Source:
  20. MUSCLE Multiple Sequence Alignment (Protein) - v3.8.425
    • Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32: 1792 1797. doi:10.1093/nar/gkh340
    • MUSCLE 3.8.425 Source:
  21. Trim Reads with Trimmomatic - v0.36
    • Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30: 2114 2120. doi:10.1093/bioinformatics/btu170