Generated December 3, 2020

KBase in the Classroom: Genome Exploration


Module 1: Genome Assembly

Authors: Steven Biller and Ellen Dow

Topics in Biology Course Applications for KBase

Synopsis: This module introduces students to the process of genome assembly by reconstructing the genome of a human pathogen.


  • Undergraduate Students
  • Graduate Students

Learning goals

At the end of this module, you should be able to:

  • Explain how the 'shotgun' genome sequencing approach works
  • Describe the basic process of how individual DNA sequences are generated
  • Understand how to evaluate the output of a genome assembly run

Biological Topics and Concepts

  • genome sequencing

Activity Description

This Narrative steps students through the process of assembling a bacterial genome from short-read data. The overall goal of this exercise is to learn to use genomic tools in order to develop a hypothesis concerning the potential genomic determinants of antibiotic resistance in a strain of Methicillin-resistant Staphylococcus aureus (MRSA).


  • Sequence
  • Contig
  • Assembly
  • Closed genome vs draft genome

Genome Modules

  1. Genome Assembly
  2. Genome Annotation
  3. Genome Analysis
  4. Comparative Genomics


v1.1 (27 Oct 2020): Revised comprehension questions
v1.0 (5 Oct 2020): Student Version
v0.9 (22 Sep 2020): Minor edits in preparation for release
v0.5 (26 Aug 2020): Completed initial set of student questions
v0.1 (5 Aug 2020): Drafting Fundamentals of Genome Exploration Module

Know your organism: Staphylococcus aureus

S. aureus ("staph") is a common component of the human microbiome. These Gram-positive bacteria cells are frequently found on your skin and upper respiratory tract, where in most instances they are benign. Some strains of S. aureus , however, can cause infected skin lesions, and also lead to infections in lungs, in the bloodstream, joints, and elsewhere. Of particular concern is the emergence in recent years of Methicillin-resistant S. aureus (MRSA). These microbes are resistant to beta-lactam antibiotics such as penicillin, oxacillin and methicillin, making these infections very difficult to treat. MRSA strains were first identified in the 1960s, but have become increasingly prevalent - particularly in hospital settings - due to the overprescription of antibiotics. The CDC estimates that there were ~120,000 S. aureus bloodstream infections in 2017 - with nearly 20,000 associated deaths.

Staphylococcus_aureus_VISA_2.jpg (Image courtesy of CDC/ Matthew J. Arduino, DRPH; CDC PHL 1157)

'Shotgun' sequencing


A common approach for sequencing a microbial genome is to take many copies of the organism's DNA, randomly break those up into shorter pieces, and then to sequence each indvidiual fragment. This 'shotgun sequencing' approach has been used to sequence many strains of MRSA, which has helped scientists track and mitigate the spread of the organism. Here, we are going to use shotgun sequencing data to identify antimicrobial resistance genes!

Each individual DNA sequence read obtained from shotgun sequencing is short - typically on the order of 40-300bp when using Illumina sequencing technology. Thus, any one fragment is of limited value to answering our question about the genetic basis of antibiotic resistance. So, we need to recreate the original genome. This is done by piecing all of these fragments together to recreate the original genome. You can think about this like putting together a giant jigsaw puzzle. With a jigsaw puzzle you look for overlapping regions of similar shapes, colors, etc to help you figure out how to put the pieces together. Similarly, to reassemble the genome a computer algorithm looks through the sequencing reads - often millions of them! - and looks for short regions of overlap. However, unlike with a jigsaw puzzle, we do not know ahead of time what the 'correct' picture is supposed to look like, nor if we even have all of the pieces!


Sequencing an individual fragment (Illumina technology)

This genome was sequenced used paired reads. This means that from each individual random fragment of DNA from the genome, the sequencer reads the first 100 nucleotides from each end. Though we don't know the sequence in the middle of this fragment, we do know that the two paired sequences were located close to one another in the genome - information that improves our assembly.


A detailed animation reviewing the process of Illumina sequencing can be found here

Importing the sequence reads into KBase

Public Data

We will be working with publicly available data for a strain named Staphylococcus aureus MRSA177. The raw data was generated by the Genome Center at Washington University School of Medicine in St. Louis as part of the Human Microbiome Project - a large initiative to better understand human-associated microbes. The data files were imported from the NCBI Sequence Read Archive (SRA), which is a primary US repository for DNA sequence data. These data are from Acesssion SRX036759.

Importing Data

The data import has already been carried out using the "Import SRA File as Reads From Web" tool.

Import an SRA file from a web URL into your Narrative as a Reads data object.
This app completed without errors in 23m 3s.
Created Object Name Type Description
SRR088898 PairedEndLibrary Imported Reads

Are the reads any good?

During the sequencing process, a variety of sources of error can make the instrument be less confident that it has called the correct base at a given position. These are 'quality scores'. The general expected pattern is that sequence quality is usually good in the beginning, and gradually declines over the length of the read. Low quality regions of a sequence are thus more likely to contain errors.

To examine the quality of our reads, let's use a tool called FastQC. Take a look below at the output from this tool. It shows the overall distribution of quality scores, on a scale from 0-40 (40 is best, 0 is worst). The 'green' range is considered generally acceptable quality scores, while sequences in the yellow to red region have a much higher probability of containing errors.

Note that we imported paired sequence reads. You can look at the results from the forward and reverse sequences independently by using the "Page 1" (Forward) and "Page 2" (Reverse) buttons in the report below. For now, focus on the "Per base sequence quality" plot.

Question to answer in class:

Q1) How does the quality profile of the forward reads compare to that of the reverse reads?

A quality control application for high throughput sequence data.
This app is new, and hasn't been started.
No output found.

Removing (trimming) away low-quality regions of sequence

With genome assemblies, errors in the input data reads can decrease the quality of the final assembly.

Remember that the distributions you examined above were averages across millions of reads - even if the population had overall good quality, invidual reads can still have errors. To get rid of the low quality regions, we will use a tool called Trimmomatic. This will go through every individual read from the sequencer, and if there are stretches of a sequence read that fall below a threshold quality score, the tool will simply remove them from the dataset.

Note that this is not going to randomly delete every single bad base - the assembly algorithm knows that some errors occur. This step is instead focused on removing larger regions of reads, or even entire reads, that are not good enough to use.

Question to answer in class:

Q2) Of the total reads we started with, how many are now left after trimming?

Note: Remember that our sequences originally came in pairs! Thus, when the software describes 'read pairs' or instances where there are 'both surviving', you need to multiply that value by 2 to get the actual number of individual sequence reads involved.

Trim paired- or single-end Illumina reads with Trimmomatic.
This app is new, and hasn't been started.
No output found.

What did the trimming do?

Let's check and see what the trimming step did to our input reads by running the FastQC tool again.

A quality control application for high throughput sequence data.
This app is new, and hasn't been started.
No output found.

Sequences, assemble!

Now that we have all of our raw data quality filtered, we're ready to put them all together into a genome! First, we will use an algorithm called IDBA-UD to assemble the genome of our Staphylococcus aureus strain. Note that assembly algorithms can sometimes take a while to run, so this has already been run for you.

Most modern assembly algorithms that work with short-read sequence data are based on the idea of a de Bruijn graph. In essence, this approach takes each individual sequence read and breaks it up into all possible subsequences of a given length (k, typically an odd number). A 'graph' of the relationships between all k-mers in the original reads is created, which allows the algorithm to rapidly identify overlaps between subsequences and to reconstruct the original sequence. Note that this ia a gross oversimplification of assembly - when working with real data, assemblers must handle all sorts of different errors, repeated sequences, and other challenges in the data. debruijn_revised-01.png

Assemble paired-end reads from single-cell or metagenomic sequencing technologies using the IDBA-UD assembler.
This app is new, and hasn't been started.
No output found.

Hmm... is that the best we can do?

This ran the assembly, where the algorithm pieced together as much of the DNA into contiguous stretches of DNA (contigs) as possible. To figure out how well an assembly went, we first consider some descriptive statistics of what we got back. Take a look at the output from the QUAST tool, which describes some of the key results we consider when evaluating an assembly. There are a lot of numbers here, but let's focus on a few key ones:

  • Total # contigs: ie, how many unique pieces of DNA the assembler could piece together.
  • How long was the longest contig? Bigger is better!
  • What is the total length of the entire combined genome?
  • N50. A somewhat non-intuitive but commonly used statistic which reflects how much of your genome is in long contigs. This value is somewhat analogous to a median: if you order all of the contigs, what is the length of the contig which is at the midpoint (50%) of the total assembly size? Bigger is better!

But, IDBA-UD isn't the only tool out there to assemble genomes. Other assemblers use different approaches to handle the many complications that can arise with trying to piece all of these sequence reads together, make different assumptions about the nature of your input data, or are better optimized for specific types of assembly challenges. Let's try another one called SPAdes. Again, for time reasons this has already been done for you, using the same set of trimmed input data files as above.

Assemble reads using the SPAdes assembler.
This app is new, and hasn't been started.
No output found.

How good are these assemblies? Which one is better? Are either of them 'right'?

Most bacterial genomes are found in the form of one circular chromosome (sometimes along with one or more plasmids). But shotgun sequencing doesn't always - and in fact usually doesn't - yield a single sequence. Instead, a genome could be reassembled into multiple contigs. It's like if you have different sections of your jigsaw puzzle put together, but still aren't sure how to put each chunk together in relation to each other. A genome that is completely put together (ie each chromosome's entire sequence, from start to finish, is in one continuous piece) is known as a 'complete' or 'closed' genome; if each chromosome's sequence is made up of multiple smaller segments, where the relationship between the segments is not completely resolved, we refer to it as a 'draft' genome.

Questions to answer in class

Q3) If a bacterium had one circular chromosome and two plasmids, how many contigs would you expect to come out of the assembly if it was closed?

Q4) How many contigs - contiguous stretches of DNA - are in each of the two assemblies? Does this mean that the S. aureus strain has that many chromosomes in the cell?

Q5) Looking at the QUAST metrics from the two assemblies, which one do you think is the 'better' assembly, and why?

Note the name of the assembly object you've chosen as the winner - you'll need that in Part 2.

Comprehension questions

CQ1) A labmate provides you with a 'draft' quality genome sequence of a new Staphylococcus aureus strain. After thoroughly searching through the genome, your colleague is unable to find any antibiotic resistance genes/mutations, and therefore concludes that this strain must be Methicillin-sensitive. Given what you now know, do you agree or disagree with this conclusion, and why?

What's next?

Now you have a reasonable assembly of the DNA sequence of the genome. To help you figure out what it means, next we'll move on to Module 2: Annotation!


  1. Assemble Reads with IDBA-UD - v1.1.3
    • Peng Y, Leung HCM, Yiu SM, Chin FYL. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28: 1420 1428. doi:10.1093/bioinformatics/bts174
  2. Assemble Reads with SPAdes - v3.13.0
    • [1] Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology. 2012;19: 455-477. doi: 10.1089/cmb.2012.0021
  3. Assess Read Quality with FastQC - v0.11.5
    • FastQC source: Bioinformatics Group at the Babraham Institute, UK.
  4. Import SRA File as Reads From Web - v1.0.7
    • Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL, Maslov S, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnology. 2018;36: 566. doi: 10.1038/nbt.4163
  5. Trim Reads with Trimmomatic - v0.36
    • Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30: 2114 2120. doi:10.1093/bioinformatics/btu170