Date |
Presenter(s) |
Institution or Advisor/Lab |
Topic & Abstract |
Location or Zoom Only |
---|
3/29/23 (Wed.) |
Christina Boucher |
University of Florida, Computer Science |
Building scalable indexes that can be efficiently queried
Abstract: Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. We later showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding --- but they did not say how to find those thresholds. We present another novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation can rapidly find MEMs between reads and large sequence collections of highly repetitive sequences. Compared to existing methods, ours used 2 to 11 times less memory and was 2 to 32 times faster for index construction. Moreover, our method was less than one thousandth the size of competing indexes for large collections of human chromosomes. |
Iribe #4105 (11am-12pm) |
3/30/23 |
Yuelin Liu |
Sahinalp & Pop Labs |
Single-cell methylation sequencing data reveal succinct metastatic migration histories and tumor progression models
Abstract: Recent studies exploring the impact of methylation in tumor evolution suggest that while the methylation status of many of the CpG sites are preserved across distinct lineages, others are altered as the cancer progresses. Since changes in methylation status of a CpG site may be retained in mitosis, they could be used to infer the progression history of a tumor via single-cell lineage tree reconstruction. In this work, we introduce the first principled distance-based computational method, Sgootr, for inferring a tumor’s single-cell methylation lineage tree and jointly identifying lineage-informative CpG sites which harbor changes in methylation status that are retained along the lineage. We apply Sgootr on the single-cell bisulfite-treated whole genome sequencing data of multiregionally-sampled tumor cells from 9 metastatic colorectal cancer patients made available by Bian et al., as well as multiregionally-sampled single-cell reduced-representation bisulfite sequencing data from a glioblastoma patient made available by Chaligne et al.. We demonstrate that the tumor lineages constructed reveal a simple model underlying colorectal tumor progression and metastatic seeding. A comparison of Sgootr against alternative approaches shows that Sgootr can construct lineage trees with fewer migration events and more in concordance with the sequential-progression model of tumor evolution, in time a fraction of that used in prior studies. Interestingly, lineage-informative CpG sites identified by Sgootr are in inter-CpG island (CGI) regions, as opposed to CGI’s, which have been the main regions of interest in genomic methylation-related analyses. Sgootr is implemented as a Snakemake workflow, available at https://github.com/algo-cancer/Sgootr. |
CSIC #3117 |
3/30/23 |
Noor Pratap Singh |
Patro Lab |
TreeTerminus - Creating transcript trees using inferential replicate counts
Abstract: The accuracy and robustness of many types of analyses performed using RNA-seq data are directly impacted by the quality of the transcript and gene abundance estimates inferred from this data. However, a certain degree of uncertainty is always associated with the transcript abundance estimates. This uncertainty may make many downstream analyses, such as differential testing, difficult for certain transcripts. Conversely, gene-level analysis, though less ambiguous, is often too coarse-grained. To circumvent this problem, methods have proposed grouping transcripts together into distinct inferential units that should be used as a base unit for analysis. However, these methods don’t take downstream analysis into account. We introduce TreeTerminus, a data-driven approach for grouping transcripts into a tree structure where leaves represent individual transcripts and internal nodes represent an aggregation of a transcript set. TreeTerminus constructs trees such that, on average, the inferential uncertainty decreases as we ascend the tree topology. The tree provides the flexibility to analyze data at nodes that are at different levels of resolution in the tree and can be tuned depending on the analysis of interest. To obtain fixed groups for the downstream analysis, we provide a dynamic programming (DP) approach that can be used to find a cut through the tree that optimizes one of several different objectives. We evaluated TreeTerminus on two simulated and two experimental datasets, and observed an improved performance compared to transcripts (leaves) and other methods under several different metrics. |
CSIC #3117 |
3/30/23 |
Ataberk Donmez |
Kolmogorov & Pop Labs |
stRainy: assembly-based metagenomic strain phasing using long reads
Abstract: Bacterial species in microbial communities are often represented by mixtures of strains. Variation in strain genomes may have important phenotypic effects, however strain-level deconvolution of microbial communities remains challenging. Short-read approaches can be used to detect small-scale variation between strains, but fail to phase these variants into contiguous haplotypes. Recent advances in long-read metagenomics resulted in complete de novo assemblies of various bacterial species. However, current assembly approaches often suppress strain-level variation, and instead produce species-level consensus representation. Strain variants are often unevenly distributed, and regions of high and low heterozygosity may interleave in the assembly graph, resulting in tangles. To address this, we developed an algorithm for metagenomic phasing and assembly called stRainy. Our approach takes a sequence graph as input, identifies graph regions that represent collapsed strains, phases them and represents the results in an expanded and simplified assembly graph. We benchmark stRainy using simulated data and mock metagenomic communities and show that it achieves strain-level deconvolution with high completeness and low error rates, compared to the other strain assembly and phasing approaches. |
CSIC #3117 |
4/6/23 |
Jason Fan |
Patro Lab |
Spectrum preserving tilings enable sparse and modular reference indexing
Abstract: The reference indexing problem for k-mers is to pre-process a collection of reference genomic sequences ℛ so that the position of all occurrences of any queried k-mer can be rapidly identified. An efficient and scalable solution to this problem is fundamental for many tasks in bioinformatics.
In this work, we introduce the spectrum preserving tiling (SPT), a general representation of R that specifies how a set of tiles repeatedly occur to spell out the constituent reference sequences in R. By encoding the order and positions where tiles occur, SPTs enable the implementation and analysis of a general class of modular indexes. An index over an SPT decomposes the reference indexing problem for k-mers into: (1) a k-mer-to-tile mapping; and (2) a tile-to-occurrence mapping. Recently introduced work to construct and compactly index k-mer sets can be used to efficiently implement the k-mer-to-tile mapping. However, implementing the tile-to-occurrence mapping remains prohibitively costly in terms of space. As reference collections become large, the space requirements of the tile-to-occurrence mapping dominates that of the k-mer-to-tile mapping since the former depends on the amount of total sequence while the latter depends on the number of unique k-mers in R.
To address this, we introduce a class of sampling schemes for SPTs that trade off speed to reduce the size of the tile-to-reference mapping. We implement a practical index with these sampling schemes in the tool pufferfish2. When indexing over 30,000 bacterial genomes, pufferfish2 reduces the size of the tile-to-occurrence mapping from 86.3GB to 34.6GB while incurring only a 3.6× slowdown when querying k-mers from a sequenced readset. |
Iribe #4105 |
4/6/23 |
Yunheng Han |
Molloy Lab |
TREE-QMC: Improving quartet graph construction for scalable and accurate species tree estimation from gene trees
Abstract: Summary methods are one of the dominant approaches for estimating species trees from genome-scale data. However, they can fail to produce accurate species trees when the input gene trees are highly discordant due to gene tree estimation error as well as biological processes, like incomplete lineage sorting. Here, we introduce a new summary method TREE-QMC that offers improved accuracy and scalability under these challenging scenarios. TREE-QMC builds upon the algorithmic framework of QMC (Snir and Rao 2010) and its weighted version wQMC (Avni et al. 2014). Their approach takes weighted quartets (four-leaf trees) as input and builds a species tree in a divide-and-conquer fashion, at each step constructing a graph and seeking its max cut. We improve upon this methodology in two ways. First, we address scalability by providing an algorithm to construct the graph directly from the input gene trees. By skipping the quartet weighting step, TREE-QMC has a time complexity of O(n3k) with some assumptions on subproblem sizes, where n is the number of species and k is the number of gene trees. Second, we address accuracy by normalizing the quartet weights to account for “artificial taxa,” which are introduced during the divide phase so that solutions on subproblems can be combined during the conquer phase. Together, these contributions enable TREE-QMC to outperform the leading methods (ASTRAL-III, FASTRAL, wQFM) in an extensive simulation study. We also present the application of these methods to several phylogenomics data sets. Note: preliminary results for this project were presented in RIPS 2022. This talk includes many new results, including methodological changes to improve robustness to missing data.
|
Iribe #4105 |
|