The past decade has seen explosive growth in the collection and use of metagenomic data—genetic material (DNA or RNA) extracted directly from environmental or clinical samples such as soil, water or the human gut. While the surge is fueling breakthroughs in disease tracking, antimicrobial resistance analysis and enzyme discovery, it is also creating significant data-management challenges.
Researchers at the University of Maryland are working to address those challenges with support from two grants from the National Institutes of Health totaling $5.1 million. The team is developing open-source software to assemble complex metagenomic datasets and is conducting the first systematic analysis of the ways in which data integrity in databases affects the accuracy of analysis pipelines.
“The overall goal is to create new analytic pipelines that can more effectively mine data to improve human health and combat disease,” says Mihai Pop, professor of computer science and principal investigator on both awards.
Metagenomic analysis increasingly relies on advanced technologies, including artificial intelligence, which have significantly improved taxonomic profiling (identifying “who is there”) within microbial communities, and functional profiling (determining “what they are doing.”)
“But AI is not foolproof, and correctly profiling genetic material is imperative so clinicians can make better diagnoses and offer more personalized therapeutic treatments,” says Pop, who has an appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS) and serves as co-director of the University of Maryland Center of Excellence in Microbiome Sciences.
Advancing Metagenomic Assembly
One of the awards—a $2.7 million grant from the National Institute of Allergy and Infectious Diseases—will support development of new software to advance metagenomic assembly, the process of stitching sequencing reads into contiguous DNA sequences and metagenome-assembled genomes.
Interest in reconstructing entire genomes from metagenomic data is growing because nearly complete genome sequences provide rich biological insight, Pop noted. However, existing tools often struggle to integrate data from different sequencing technologies or multiple metagenomic samples.
Frequently, multi-sample analyses require co-assembly of all sequencing reads, a computationally intensive approach that does not scale well. Integrating different sequencing technologies typically involves assembling data with one platform and then mapping additional data onto that assembly.
Pop’s team—including a new postdoctoral researcher and a recently hired bioinformatics engineer—will develop novel assembly reconciliation techniques that integrate data from multiple assemblies without requiring co-assembly and that treat all individual assemblies as equivalent.
Technical support for the project is being provided by the UMIACS, including powerful computing clusters used to analyze hundreds of thousands of nodes and edges needed to improve metagenome assembly by combining fragmented datasets from multiple samples into more complete genomic reconstructions.
Safeguarding Public Biological Databases
The second grant, $2.4 million from the National Library of Medicine, is supporting a rigorous analysis of public biological databases, focusing on how data interact with the algorithms used in bioinformatics research.
Scientists and clinicians worldwide deposit large volumes of biological data and research results into public databases. These shared repositories—maintained by government agencies, global nonprofits and international partners—contain extensive information on protein sequences, molecular structures, DNA sequences and more.
With the volume of shared genomic data increasing exponentially in recent years—driven in part by new technologies—the risk of erroneous or corrupted submissions has grown. Such imperfections can skew research results and may have serious consequences if flawed data are used in clinical diagnoses or therapeutic decision-making.
Working with colleagues in UMIACS, including Tudor Dumitras, associate professor of electrical and computer engineering, and Brantley Hall, assistant professor of cell biology and molecular genetics, the researchers aim to identify:
- Which algorithmic paradigms are most vulnerable to errors
- Which data elements exert outsized influence on bioinformatics outputs
- How interactions between input data and reference databases affect the resilience of complex analytical pipelines
The team will also develop countermeasures to address adversarial machine-learning attacks on public databases, which Dumitras described as a growing concern.
Building a Forensic Toolkit for Bioinformatics
As part of the second project, the UMD team plans to create tools with broad utility, including:
- Test harnesses to assess the accuracy of taxonomic and functional annotation tools
- Software to analyze biological database structure across different feature spaces (e.g., k-mer–based versus alignment-based)
- Tools to iteratively modify database labels or sequences and measure resulting accuracy changes
- Provenance-graph software to track how errors and vulnerabilities propagate through heterogeneous pipelines
Together, these resources will form a forensic and debugging toolkit enabling scientists to apply the University of Maryland’s techniques to their own bioinformatics workflows, Pop says.
“To err is human, but errors are also unavoidable in databases and computational systems,” he adds. “As humans, we have learned to anticipate and mitigate the effects of the mistakes we may make. We hope that our research program will lead to biomedical analytics systems able to draw correct conclusions from imperfect data, helping to advance human and environmental health.”
—Story by UMIACS communications group