Assembly and Analysis Software for Exploring the Human Microbiome
The main challenge in metagenomic assembly arises from the heterogeneous nature of metagenomic data. Most environments contain an uneven representation of the member species, and furthermore, the organisms in the environment frequently belong to clusters of closely related strains whose genomes are largely similar but differ due to mobile genetic elements and point mutations. These characteristics of the data make it virtually impossible to construct a single assembly of each organisms present in a sample, instead many organisms will be under-sampled and will be assembled in a highly fragmented form, while groups of closely related organisms will end up assembled together into a polymorphic structure that can be modeled as a computational graph.
We are currently exploring several approaches for analyzing and visualizing metagenomic assembly graphs, including procedures for graph simplification, for detection of genomic polymorphisms (work related to our research on the analysis of genomic variation from assembly information), and new approaches for repeat identification and resolution.
Gene finding in metagenomic data-sets is complicated by the fragmented nature of metagenomic assemblies, and by the fact that many organisms are only poorly sampled, potentially leading to fragmentation and frame-shifts due to high error rates. We are working on extensions of the Glimmer gene finder to accommodate these characteristics of metagenomic data.
We have developed a metagenomic binning program specifically targeted at short DNA fragments (such as reads). This program, called Phymm, uses the Interpolated Markov Model framework from Glimmer to accurately classify reads as short as 100bp. We are currently exploring whether binning reads prior to assembly can improve the quality of metagenomic analysis.
This is an NSF project. See more here