Improved Analysis of Metagenomes through the application of Read-Sized Profile HMMs to Marker Gene Subsequences

This project is funded by a R21 grant from the NIH Institute for Allergy and Infectious Disease.

The study of the human microbiome, with its multitudes of host-associated organisms, holds great promise for increasing our understanding of human health and disease. With its fragmented sequence data unlinked from genome of origin information, the particular challenge of metagenomics is how to provide reliable functional annotation and taxonomic assignment. Here we address these issues by leveraging existing profile hidden Markov models (HMMs) of functionally characterized gene families. Instead of relying on fragment matches to full-length genes or gene models, we will determine which segments of gene models are capable of high-quality annotations of function and origin, and focus on those. By this approach, the portions of the gene models that have low sequence conservation or have variable insertion/gap length (tending towards low recall), or those that are composed of sequence shared among multiple gene families and functions (tending towards low precision) are systematically eliminated, increasing overall signal-to-noise. The high-quality segments of the models (?mini? HMMs) will be our analytical tools. Using these methods we hope to provide robust approach that frees metagenomics from the limitations of assembly-first strategies, and thereby provide access to information about the numerous low-abundance species in complex biological samples. We will use bacterial single-copy genes as taxonomic markers, and will produce a database of these genes from high-quality genomes. We expect to identify ~80 suitable marker genes, determined for several thousand genomes. For each of these genes, we will produce a corresponding reference phylogenetic tree. In the course of producing these resources, the existing models (TIGRFAMs and Pfam HMMs) will be updated based on the current set of reference genomes and a constant, state-of-the-art construction process. These resources, and any software we produce will be made available through our public website. With these methods and resources, we will obtain taxonomic profiles, investigate genes of interest and devise methods for linking those genes to the taxa in the profile. We will utilize real and synthetic metagenomes to perform validation of the methods, and establish statistical confidence metrics for our results.

Specific Aims
To fulfill their promise, microbiome studies need to not only produce accurate lists of what organisms are present in a sample, but must also attribute specific functions to those taxa. For instance, knowing which organisms are pro-inflammatory or drug-resistant may be critical in understanding and treating microbiome-associated disease. Current methods for metagenomic data analysis rely on information from all the reads in the data, even those that carry ambiguous taxonomic or functional information, thereby leading to errors in the assignment of function and taxonomic origin. Such errors are particularly relevant when the assembly of large genomic segments is not possible, as is the case for the low-abundance taxa that comprise the bulk of microbiome samples. Here we propose that profile HMMs can be applied to gene fragments to systematically identify discrete read-sized markers that provide robustly accurate taxonomic profiles, and quantitative measures of biological functions. We will develop the following resources and methods to enable this new approach:

Aim 1: Develop a single copy gene-based, read-by-read method for the accurate and quantitative determination of bacterial taxa present in metagenome samples.
We will first develop an up-to-date database of single-copy bacterial genes, and high-quality sequenced genomes that contain all such genes. Using this reference and protein sequence HMM methods, we will discover short marker regions and corresponding “mini” HMMs that have superior recall and precision in the detection of single-copy genes from read-sized sequence fragment data. Based on the mHMM performance, we will identify regions (PhyloBlocks) that have the additional quality of being able to discriminate most bacterial genera in phylogenetic placement procedures. Reference phylogenetic trees based on these regions (concatenated) for each marker gene will be calculated. Existing tree-placement methods will work with these alignements to make phylogenetic placements of mHMM-derived metagenoic sequences and determine confidence values for those assignments. Finally, we will use synthetic and natural metagenome datasets to test, calibrate and validate these methods.

Aim 2: Apply the methods of Aim 1 to taxonomically restricted genes of interest from metagenome samples, and develop methods for the association of those genes with the taxa identified through the methods developed in Aim 1.
Genes of interest in human microbiome studies (for instance: toxins, antibiotic resistance genes, inflammatory factors or oxidative stress-response genes) will not be shared by all taxa of bacteria, and unlike universal genes, may have been subject to lateral gene transfer. Subsequently, genes may be discovered that do not appear to correspond closely to taxa in the set of known genes of interest, due to sparsity in the reference trees, or gene loss events and subsequent incomplete knowledge of the gene’s true taxonomic profile. Conversely, due to lateral gene transfer, a gene of interest may appear to be closely related to a member of the gene of interest profile, but yet actually reside within the genome of an entirely distinct organism. We will investigate methods for constraining the taxonomic placement of reads from genes of interest based on the global taxonomic profile of the sample (from single-copy marker genes), and the inferred abundances of the organisms encoding the genes of interest.