Generated December 3, 2020

Welcome to Metagenome Analysis 101

Student

Authors: Elisha Wood-Charlson, Jon Benskin, Carlos Goller, Ellen Dow

Audience

  • High School Students
  • Undergraduate Students
  • Graduate Students
  • Biology, Bioinformatics, Genetics, Genomics, Proteomics, CSS, etc

Learning goals

  • Evaluate read quality based on FastQC reports
  • Perform read trimming with Trimmomatic.
  • Explain in your own words how adapter contamination can
    • a) affect read quality
    • b) be addressed with Trimmomatic

Biological Topics and Concepts

  • Gene sequencing
  • Data File types
  • Quality Control of raw files

Activity Description

This Narrative is designed to import data for Metagenome analysis modules and perform initial quality checks on data.

Data Source

Where are these data from? Scientists dive deep to explore mysterious 'blue hole' on the Florida seabed. Link to article

Data Source: 106m depth, Collected May 2019 Patin, Nastassia; Stewart, Frank; Hall, Emily; Dietrich, Zoe; Beckler, Jordon (2020): Blue Hole Shotgun Metagenome: May 2019, 106 M. figshare. Dataset. https://doi.org/10.6084/m9.figshare.12644048.v1

Metagenome Modules

  1. Data Input
  2. Assembly
  3. Read-based Taxonomy
  4. Binning
  5. Bin-based Taxonomy
  6. Taxonomy and Evoltion

Background

Where are these data from? Scientists dive deep to explore mysterious 'blue hole' on the Florida seabed. Link to article

Data Source: 106m depth, Collected May 2019 Patin, Nastassia; Stewart, Frank; Hall, Emily; Dietrich, Zoe; Beckler, Jordon (2020): Blue Hole Shotgun Metagenome: May 2019, 106 M. figshare. Dataset. https://doi.org/10.6084/m9.figshare.12644048.v1

Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114-2120. doi:10.1093/bioinformatics/btu170

Videos:

Import a Paired-End Library into your Narrative as a Reads object.
This app completed without errors in 11m 25s.
Summary
Import Finished Imported Reads: 1 Reads Name: BH_106m_052019 Reads Info: "qual_min": "2.0", "qual_mean": "36.8494", "sequencing_tech": "Illumina", "number_of_duplicates": "68398", "read_length_mean": "301.0", "qual_stdev": "3.0131", "read_length_stdev": "0.0", "qual_max": "38.0", "total_bases": "2245416054", "single_genome": "0", "gc_content": "0.46970300000000004", "phred_type": "33", "read_count": "7459854"

Step 1) Upload FASTQ file of raw read(s) and inspect the data

Click on the reads file in the Data Panel (top left) to add the data object to your Narrative (this panel). Inspect the data object by reviewing the Overview and Stats tabs.

Questions to answer:

Q1) How many reads are in your metagenome?

Q2) What is your mean read length?

Q3) What is the total sequencing size in gigabase pairs?

Q4) What is the GC percentage of the imported data?

Optional step if reads are not joined

If paired-end reads are imported as R1 and R2 (individual data sets), run FastQ-Join to combine into a single paired-end read data set before continuing.

Step 2) Assess Read Quality with FastQC

Quality check of sequence reads to identify low quality reads. In the App Panel (bottom left), search for FastQC. Click on the App name to add the App to the Narrative below this Narrative cell. Select the reads file from the drop-down menu, and click Run to start the analysis.

KBase Tips

Starting an analysis will save your Narrative workspace and send the commands to KBase's compute resources to run the job. If you need to close the Narrative at this point, or after any analysis begins, the job continues to run. Results will update in the App cell below automatically, once the job is complete.

Note: If you edit the Narrative and do NOT run an analysis, you must manually save the Narrative workspace by clicking the save icon in the menu at the top.

Questions to answer:

Note: Each read pair has a separate report page. Use the Page 1 and Page 2 buttons to review both reports.

Q5) At what base pair does your average quality seem to start dropping (scroll through the FastQC panel on the right) for forward reads? Reverse?

Q6) Based on these results, how should you trim your data? Which data would you expect to be changed after trimming?

A quality control application for high throughput sequence data.
This app is new, and hasn't been started.
No output found.

Step 3) Trim reads with Trimmomatic

Trimmomatics removes low quality reads as well as adapter sequences. Find the App in the App Panel and add to the Narrative as in Step 2. Most of the advanced parameters can be run with default parameters, but how do you know what Adapter to select?

Advanced extension- Using the article here,

  • What is the theory behind trimming? Why is it important? Why is it difficult?
Trim paired- or single-end Illumina reads with Trimmomatic.
This app is new, and hasn't been started.
No output found.

Step 4) Rerun FastQC

Questions to answer:

Q7) What percentage of your total paired reads (forward and reverse) survived Trimmomatic?

Q8) What is the mean read length after trimming? How has it changed from before trimming?

Hint - Data Panel objects have additional details that are displayed when you hover over the object click on "..."

Q9) What has changed in our FastQC output since running trimmomatic? (scroll through the FastQC panel on the right)

A quality control application for high throughput sequence data.
This app is new, and hasn't been started.
No output found.
A quality control application for high throughput sequence data.
This app is new, and hasn't been started.
No output found.

Optional: Additional understanding of read quality - Adaptors

Why did many of the reverse reads not survive Trimmomatic? Review the original FastQC results. Try running Trimmomatic without selecting the Adaptor for trimming and compare. Remember that each one of these samples went through numerous preparation steps in the laboratory prior to sequencing. Sometimes samples and protocols do not always produce great data. It is important to do proper QC prior to running any data analyses to ensure you have quality data going into your analysis pipeline.

Trim paired- or single-end Illumina reads with Trimmomatic.
This app is new, and hasn't been started.
No output found.
A quality control application for high throughput sequence data.
This app is new, and hasn't been started.
No output found.

Apps

  1. Assess Read Quality with FastQC - v0.11.5
    • FastQC source: Bioinformatics Group at the Babraham Institute, UK.
  2. Import Paired-End Reads from Web - v1.0.12
    • Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163
  3. Trim Reads with Trimmomatic - v0.36
    • Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30: 2114 2120. doi:10.1093/bioinformatics/btu170