The analysis of these vast amounts of data is complicated by the fact that reconstructing large genomic segments from metagenomic reads is a formidable computational challenge. Even for single organisms, the assembly of genome sequences from sequencing reads is a complex task, primarily due to ambiguities in the reconstruction that are caused by genomic repeats. In metagenomic data, additional challenges arise from the non-uniform representation of genomes in a sample as well as from the genomic variants between the sequences of closely related organisms. Despite advances in metagenomic assembly algorithms over the past years, the computational difficulty of the assembly process remains high and the quality of the resulting data fairly low.
As a result, many analyses of metagenomic data are performed directly on unassembled reads, however the much shorter genomic context leads to lower accuracy.
Reference-guided, comparative assembly approaches have previously been used to assist the assembly of short reads when a closely related reference genome was available Comparative assembly works as follows: short sequencing reads are aligned to a reference genome of a closely related species, then their reconstruction into contigs is inferred from their relative locations in the reference genome. This process overcomes, in part, the challenge posed by repeats as the entire read (not just the segment that overlaps adjacent reads) provides information about its location in the genome.
Currently, tens of thousands of bacterial genomes have been sequenced, and the number is expected to grow rapidly in the near future. These sequenced genomes provide a great resource for performing comparative assembly of metagenomic sequences, however they have yet to be used for this purpose in no small part due to the tremendous computational cost of aligning the reads from a metagenomic project to the entire reference collection of bacterial genomes.
MetaCompass is the first assembly software package for the reference-assisted assembly of metagenomic data. We rely on an indexing strategy to quickly construct sample-specific reference collections, and show that this approach effectively complements de novo assembly methods.