Generate a functional profile of sequence read libraries with Fama
SUMMARY
Functional profile captures genetic potential of microbial community for biological processes of interest, just like taxonomic profile captures microbial diversity. This app is based on the Fama computational tool for functional profiling of microbiomes and taxonomic profiling of functional genes.
Fama examines genetic potential of a microbial community and taxonomic composition of functional genes by direct mapping of individual sequence reads to a curated reference set of proteins. This tool has been developed for research projects focused on a specific metabolic process in microbial communities containing uncultured and phylogenetically distinct microbes with little similarity to known genomes. For such organisms, amino acid sequence comparisons have an advantage over nucleotide sequence comparison for characterization of genes.
Fama runs a similarity search for translated read sequences using fast aligner DIAMOND and customized databases of reference proteins. After the similarity search, all hits found by DIAMOND are filtered by AAI (amino acid identity %) with family-specific thresholds. Top hits that pass the filter are counted for functional and taxonomic assignment.
For comparison between functions and between samples, raw read counts are normalized by library size, target gene size and predicted average genome size in the sample. For normalization by average genome size, Fama employs MicrobeCensus tool. The normalization metric for single-read libraries is ERPKG (number of reads per kb of effective gene length per genome-equivalent):
ERPKG = (reads mapped to to gene) / (effective gene length in kb) / (genome equivalents),
where effective gene length = (actual gene length) + (read length) - 2 * (minimal alignment length) + 1,<\br> genome equivalents = (number of reads in library) / (average genome size)
The normalization metric for paired-end read libraries is EFPKG(number of fragments per kb of effective gene length per genome-equivalent), which is calculated similarly to ERPKG, but with fragment count instead of read count, and some minor differences in calculation of effective gene length.
If calculation of normalized scores is not possible (because of small number of reads etc.), only read counts and fragment counts are reported for single read and paired-end read libraries, respectively.
INPUT
Read Profiling requires unassembled short reads as an input. Multiple read libraries can be analyzed in a single run. For better results, adapters should be trimmed and low-quality sequences should be filtered out. All input read libraries must be of the same type, i.e. either single read or paired-end libraries.
REFERENCE DATA
Datasets of reference proteins were prepared by search for functional roles of interest in the SEED database, with additional consistency checks. Those checks include identification and removal of incomplete proteins and redundant sequences. So, reference datasets include proteins from SEED genomes, with exception of RP-L6 dataset, which contains proteins from metagenome-assembled genomes. A complete list of functional families can be found here.
Reference data v.1.4 includes three reference datasets:
- nitrogen cycle enzymes dataset for functional and taxonomic profiling of nitrate/nitrite/ammonia metabolic genes
- 30 families of universal single-copy marker proteins from complete bacterial and archaeal genomes for taxonomic profiling
- ribosomal protein L6 sequences from genomes of cultivated bacteria and metagenome-assembled genomes for fast taxonomic profiling of uncultured organisms
OUTPUT
Output of the Fama Read Profiling app includes report in HTML format, interactive profile plot for each sample, functional profile, filtered read library and link to zip archive with Excel spreadsheets and interactive plots.
The HTML report contains the "Run Info" tab with a summary of results and three tabs for each read library: "Functional profile", "Functional groups" and "Taxonomy profile". The "Functional profile" tab displays normalized score, raw read count and average amino acid identity % for each function. The "Functional groups" tab displays normalized scores, raw read counts and average amino acid identity % for functions combined into more general functional groups. The "Taxonomy profile" tab displays normalized scores by function and by taxa.
Interactive Krona plots are generated for each sample. A Krona file contains taxonomic profiles displayed as hierarchical circular plots, one plot for each function. Score of each taxon is represented by an angle of the sector, and amino acid identity % is represented by color.
The functional profile generated by the app contains scores for each function in each of the input samples. To generate heatmap-like visualization of the functional profile, use view_FamaFunctionalProfile function.
The filtered read library generated by the app contains reads from all libraries mapped to at least one function. This read library can be used for assembly of functional genes of interest, if small abundance of individual genomes or very large size of the library precludes conventional metagenome assembly.
The output zip archive contains Excel spreadsheets with combined functional profile for all samples, combined function/taxonomy profiles for all samples, and detailed function/taxonomy profile for each sample (reporting normalized score, raw read count and average AAI% for each taxon). In addition, the archive contains interactive Krona plots for all samples.
Additional resources
Team members who implemented App in KBase: Alexey Kazakov.For questions, please contact us.
Related Publications
- Kazakov A, Novichkov P. Fama: a computational tool for comparative analysis of shotgun metagenomic data. Great Lakes Bioinformatics conference (poster presentation). 2019. , https://iseq.lbl.gov/mydocs/fama_glbio2019_poster.pdf
- Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nature Methods. 2015;12: 59-60. doi: 10.1038/nmeth.3176. Publication about third-party program used by Fama. , https://pubmed.ncbi.nlm.nih.gov/25402007/
- Nayfach S, Pollard KS. Average genome size estimation improves comparative metagenomics and sheds light on the functional ecology of the human microbiome. Genome Biology. 2015;16: 51. doi: 10.1186/s13059-015-0611-7. Publication about third-party program used by Fama. , https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/25853934/
- Ondov B, Bergman NH et al. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics. 2011;12: 385. doi: 10.1186/1471-2105-12-385. Publication about third-party program used by Fama. , https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3190407/
App Specification:
https://github.com/aekazakov/FamaProfiling/tree/d9db15ea217e3be2aab65c356564a6d345b4f410/ui/narrative/methods/run_FamaReadProfilingModule Commit: d9db15ea217e3be2aab65c356564a6d345b4f410