Authors: Steven Biller and Ellen Dow
Synopsis: This module introduces students to the process of genome assembly by reconstructing the genome of a human pathogen.
At the end of this module, you should be able to:
This Narrative steps students through the process of assembling a bacterial genome from short-read data. The overall goal of this exercise is to learn to use genomic tools in order to develop a hypothesis concerning the potential genomic determinants of antibiotic resistance in a strain of Methicillin-resistant Staphylococcus aureus (MRSA).
v1.1 (27 Oct 2020): Revised comprehension questions
v1.0 (5 Oct 2020): Student Version
v0.9 (22 Sep 2020): Minor edits in preparation for release
v0.5 (26 Aug 2020): Completed initial set of student questions
v0.1 (5 Aug 2020): Drafting Fundamentals of Genome Exploration Module
S. aureus ("staph") is a common component of the human microbiome. These Gram-positive bacteria cells are frequently found on your skin and upper respiratory tract, where in most instances they are benign. Some strains of S. aureus , however, can cause infected skin lesions, and also lead to infections in lungs, in the bloodstream, joints, and elsewhere. Of particular concern is the emergence in recent years of Methicillin-resistant S. aureus (MRSA). These microbes are resistant to beta-lactam antibiotics such as penicillin, oxacillin and methicillin, making these infections very difficult to treat. MRSA strains were first identified in the 1960s, but have become increasingly prevalent - particularly in hospital settings - due to the overprescription of antibiotics. The CDC estimates that there were ~120,000 S. aureus bloodstream infections in 2017 - with nearly 20,000 associated deaths.
(Image courtesy of CDC/ Matthew J. Arduino, DRPH; CDC PHL 1157)
A common approach for sequencing a microbial genome is to take many copies of the organism's DNA, randomly break those up into shorter pieces, and then to sequence each indvidiual fragment. This 'shotgun sequencing' approach has been used to sequence many strains of MRSA, which has helped scientists track and mitigate the spread of the organism. Here, we are going to use shotgun sequencing data to identify antimicrobial resistance genes!
Each individual DNA sequence read obtained from shotgun sequencing is short - typically on the order of 40-300bp when using Illumina sequencing technology. Thus, any one fragment is of limited value to answering our question about the genetic basis of antibiotic resistance. So, we need to recreate the original genome. This is done by piecing all of these fragments together to recreate the original genome. You can think about this like putting together a giant jigsaw puzzle. With a jigsaw puzzle you look for overlapping regions of similar shapes, colors, etc to help you figure out how to put the pieces together. Similarly, to reassemble the genome a computer algorithm looks through the sequencing reads - often millions of them! - and looks for short regions of overlap. However, unlike with a jigsaw puzzle, we do not know ahead of time what the 'correct' picture is supposed to look like, nor if we even have all of the pieces!
This genome was sequenced used paired reads. This means that from each individual random fragment of DNA from the genome, the sequencer reads the first 100 nucleotides from each end. Though we don't know the sequence in the middle of this fragment, we do know that the two paired sequences were located close to one another in the genome - information that improves our assembly.
A detailed animation reviewing the process of Illumina sequencing can be found here
We will be working with publicly available data for a strain named Staphylococcus aureus MRSA177. The raw data was generated by the Genome Center at Washington University School of Medicine in St. Louis as part of the Human Microbiome Project - a large initiative to better understand human-associated microbes. The data files were imported from the NCBI Sequence Read Archive (SRA), which is a primary US repository for DNA sequence data. These data are from Acesssion SRX036759.
The data import has already been carried out using the "Import SRA File as Reads From Web" tool.
During the sequencing process, a variety of sources of error can make the instrument be less confident that it has called the correct base at a given position. These are 'quality scores'. The general expected pattern is that sequence quality is usually good in the beginning, and gradually declines over the length of the read. Low quality regions of a sequence are thus more likely to contain errors.
To examine the quality of our reads, let's use a tool called FastQC. Take a look below at the output from this tool. It shows the overall distribution of quality scores, on a scale from 0-40 (40 is best, 0 is worst). The 'green' range is considered generally acceptable quality scores, while sequences in the yellow to red region have a much higher probability of containing errors.
Note that we imported paired sequence reads. You can look at the results from the forward and reverse sequences independently by using the "Page 1" (Forward) and "Page 2" (Reverse) buttons in the report below. For now, focus on the "Per base sequence quality" plot.
Q1) How does the quality profile of the forward reads compare to that of the reverse reads?
With genome assemblies, errors in the input data reads can decrease the quality of the final assembly.
Remember that the distributions you examined above were averages across millions of reads - even if the population had overall good quality, invidual reads can still have errors. To get rid of the low quality regions, we will use a tool called Trimmomatic. This will go through every individual read from the sequencer, and if there are stretches of a sequence read that fall below a threshold quality score, the tool will simply remove them from the dataset.
Note that this is not going to randomly delete every single bad base - the assembly algorithm knows that some errors occur. This step is instead focused on removing larger regions of reads, or even entire reads, that are not good enough to use.
Q2) Of the total reads we started with, how many are now left after trimming?
Note: Remember that our sequences originally came in pairs! Thus, when the software describes 'read pairs' or instances where there are 'both surviving', you need to multiply that value by 2 to get the actual number of individual sequence reads involved.
Let's check and see what the trimming step did to our input reads by running the FastQC tool again.
Now that we have all of our raw data quality filtered, we're ready to put them all together into a genome! First, we will use an algorithm called IDBA-UD to assemble the genome of our Staphylococcus aureus strain. Note that assembly algorithms can sometimes take a while to run, so this has already been run for you.
Most modern assembly algorithms that work with short-read sequence data are based on the idea of a de Bruijn graph. In essence, this approach takes each individual sequence read and breaks it up into all possible subsequences of a given length (k, typically an odd number). A 'graph' of the relationships between all k-mers in the original reads is created, which allows the algorithm to rapidly identify overlaps between subsequences and to reconstruct the original sequence. Note that this ia a gross oversimplification of assembly - when working with real data, assemblers must handle all sorts of different errors, repeated sequences, and other challenges in the data.
This ran the assembly, where the algorithm pieced together as much of the DNA into contiguous stretches of DNA (contigs) as possible. To figure out how well an assembly went, we first consider some descriptive statistics of what we got back. Take a look at the output from the QUAST tool, which describes some of the key results we consider when evaluating an assembly. There are a lot of numbers here, but let's focus on a few key ones:
But, IDBA-UD isn't the only tool out there to assemble genomes. Other assemblers use different approaches to handle the many complications that can arise with trying to piece all of these sequence reads together, make different assumptions about the nature of your input data, or are better optimized for specific types of assembly challenges. Let's try another one called SPAdes. Again, for time reasons this has already been done for you, using the same set of trimmed input data files as above.
Most bacterial genomes are found in the form of one circular chromosome (sometimes along with one or more plasmids). But shotgun sequencing doesn't always - and in fact usually doesn't - yield a single sequence. Instead, a genome could be reassembled into multiple contigs. It's like if you have different sections of your jigsaw puzzle put together, but still aren't sure how to put each chunk together in relation to each other. A genome that is completely put together (ie each chromosome's entire sequence, from start to finish, is in one continuous piece) is known as a 'complete' or 'closed' genome; if each chromosome's sequence is made up of multiple smaller segments, where the relationship between the segments is not completely resolved, we refer to it as a 'draft' genome.
Q3) If a bacterium had one circular chromosome and two plasmids, how many contigs would you expect to come out of the assembly if it was closed?
Q4) How many contigs - contiguous stretches of DNA - are in each of the two assemblies? Does this mean that the S. aureus strain has that many chromosomes in the cell?
Q5) Looking at the QUAST metrics from the two assemblies, which one do you think is the 'better' assembly, and why?
Note the name of the assembly object you've chosen as the winner - you'll need that in Part 2.
CQ1) A labmate provides you with a 'draft' quality genome sequence of a new Staphylococcus aureus strain. After thoroughly searching through the genome, your colleague is unable to find any antibiotic resistance genes/mutations, and therefore concludes that this strain must be Methicillin-sensitive. Given what you now know, do you agree or disagree with this conclusion, and why?
Now you have a reasonable assembly of the DNA sequence of the genome. To help you figure out what it means, next we'll move on to Module 2: Annotation!