A quality control application for high throughput sequence data.
FastQC is a quality control application that allows users to perform numerous quality control checks on raw sequence data generated by high throughput sequencing pipelines such as Illumina and ABI SOLiD platforms in FASTQ format. It generates as output a comprehensive multi-page report on the composition and quality of reads in HTML format, with one page for each of the reads (e.g. Single End, Paired End: forward, Paired End: reverse). The report can be viewed inside the Narrative or as a new web page that can also be downloaded.
The HTML report includes results from multiple modules that were run by FastQC, and provides a quick assessment of the quality of the results labeled as normal (green checkmark), slightly abnormal (orange triangle), and very unusual (red cross) reads. The modules included in the report are as follows:
- Basic Statistics: provides introductory compositional statistics such as filename, file type, encoding, total sequences, number of sequences flagged as poor quality, sequence length and %GC for the analyzed read.
- Per Base Sequence Quality: displays an overview of the range of quality values across all bases at each position in the FASTQ file. The graph is vertically partitioned into three quality ranges: good (green), reasonable (orange), poor (red).
- Per Sequence Quality Scores: displays the quality score distribution over all the sequences, which allows users to see if a subset of the sequences have universally low-quality values.
- Per Base Sequence Content: displays the proportion of each DNA base (A, T, C, G) called at a given position in all the sequences.
- Per Sequence GC content: displays the GC distribution over all the sequences across the whole length and compares it to a modeled normal distribution of GC content.
- Per Base N Content: displays the percentage of base calls at each position for which an N was called. N at a given position indicates the inability to make a normal base call with sufficient confidence.
- Sequence Length Distribution: displays the distribution of fragment sizes across all sequences and is highly dependent on the sequencing platform.
- Sequence Duplication Levels: counts the degree of duplication for every sequence in a library and creates a plot showing the proportion of sequences with different degrees of duplication (in blue) and de-duplicated sequences (in red).
- Overrepresented Sequences: lists all of the sequence which make up more than 0.1% of the total. An overrepresented sequence implies that either it is highly biologically significant, the library is contaminated, or is not that diverse.
- Adapter Content: displays the proportion of sequences which have an adapter sequence at a given position. This is informative in deciding whether there is a significant amount of adapter present in the sequences and can be subjected to trimming.
- Kmer Content: displays relative K-mer (k=7) enrichment over the read length for the top six K-mers. More specifically, it measures the number of each 7-mer at each position in the library and then uses a binomial test to look for significant deviations from an even coverage at all positions and reports the 7-mers with positionally biased enrichment.
- FastQC source: Bioinformatics Group at the Babraham Institute, UK. , http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Module Commit: b7ea7b38246ac731faa2c89ea9193b662bcaf5cf