Motivation As large haplotype panels become increasingly available, efficient string matching algorithms such as PBWT are promising for identifying shared haplotypes. However, recent mutations and genotyping errors create occasional mismatches, presenting challenges for exact haplotype matching. Previous solutions are based on probabilistic models or seed-and-extension algorithms that passively tolerate mismatches. Results Here we propose a PBWT-based smoothing algorithm, P-smoother, to actively “correct” these mismatches and thus “smooth” the panel. P-smoother runs a bi-directional PBWT-based panel scanning that flips mismatching alleles based on the overall haplotype matching context, which we call the IBD (identical-by-descent) prior. In a simulated panel with 4,000 haplotypes and a 0.2% error rate, we show it can reliably correct 85% of errors. As a result, PBWT algorithms running over the smoothed panel can identify more pairwise IBD segments than that over the un-smoothed panel. Most strikingly, a PBWT-cluster algorithm running over the smoothed panel, which we call PS-cluster, achieves state-of-the-art performance for identifying multi-way IBD segments, a challenging problem in the computational community for years. We also showed that PS-cluster is adequately efficient for UK Biobank data. Therefore, P-smoother opens up new possibilities for efficient error-tolerating algorithms for biobank-scale haplotype panels. Availability Source code is available at github.com/ZhiGroup/P-smoother. Supplementary information Supplementary data are available at Bioinformatics online.
While rates of recombination events across the genome (genetic maps) are fundamental to genetic research, the majority of current studies only use one standard map. There is evidence suggesting population differences in genetic maps, and thus estimating population-specific maps are of interest. While the recent availability of biobank-scale data offers such opportunities, current methods are not efficient at leveraging very large sample sizes. The most accurate methods are still linkage-disequilibrium (LD)-based methods that are only tractable for a few hundred samples. In this work, we propose a fast and memory-efficient method for estimating genetic maps from population genotyping data. Our method, FastRecomb, leverages the efficient positional Burrows-Wheeler transform (PBWT) data structure for counting IBD segment boundaries as potential recombination events. We used PBWT blocks to avoid redundant counting of pairwise matches. Moreover, we used a panel smoothing technique to reduce the noise from errors and recent mutations. Using simulation, we found that FastRecomb achieves state-of the-art performance at 10k resolution, in terms of correlation coefficients between the estimated map and the ground truth. This is mainly due to the fact that FastRecomb can effectively take advantage of large panels comprising more than hundreds of thousands of haplotypes. At the same time, other methods lack the efficiency to handle such data. We believe further refinement of FastRecomb would deliver more accurate genetic maps for the genetics community.
While rates of recombination events across the genome (genetic maps) are fundamental to genetic research, the majority of current studies only use one standard map. There is evidence suggesting population differences in genetic maps, and thus estimating population-specific maps are of interest. While the recent availability of biobank-scale data offers such opportunities, current methods are not efficient at leveraging very large sample sizes. The most accurate methods are still linkage-disequilibrium (LD)-based methods that are only tractable for a few hundred samples. In this work, we propose a fast and memory-efficient method for estimating genetic maps from population genotyping data. Our method, FastRecomb, leverages the efficient positional Burrows-Wheeler transform (PBWT) data structure for counting IBD segment boundaries as potential recombination events. We used PBWT blocks to avoid redundant counting of pairwise matches. Moreover, we used a panel smoothing technique to reduce the noise from errors and recent mutations. Using simulation, we found that FastRecomb achieves state-of-the-art performance at 10k resolution, in terms of correlation coefficients between the estimated map and the ground truth. This is mainly due to the fact that FastRecomb can effectively take advantage of large panels comprising more than hundreds of thousands of haplotypes. At the same time, other methods lack the efficiency to handle such data. We believe further refinement of FastRecomb would deliver more accurate genetic maps for the genetics community.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.