2022
DOI: 10.1038/s41467-022-35596-5
|View full text |Cite
|
Sign up to set email alerts
|

A method to build extended sequence context models of point mutations and indels

Abstract: The mutation rate of a specific position in the human genome depends on the sequence context surrounding it. Modeling the mutation rate by estimating a rate for each possible k-mer, however, only works for small values of k since the data becomes too sparse for larger values of k. Here we propose a new method that solves this problem by grouping similar k-mers. We refer to the method as k-mer pattern partition and have implemented it in a software package called kmerPaPa. We use a large set of human de novo mu… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

4
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(7 citation statements)
references
References 37 publications
(52 reference statements)
4
3
0
Order By: Relevance
“…Here, we present Baymer, a Bayesian method to model mutation rate variation that computationally scales to large windows of nucleotide sequence context (S2 Text and S8 Table ), robustly manages sparse data through an efficient regularization strategy, and emits posterior probabilities that capture uncertainty in estimated probabilities. Consistent with previous studies [24][25][26], we show that expanded sequence context models in most current human datasets are overfit with classic empirical methods but considerably improve model performance when properly regularized. As a result, this method allows for renewed evaluation of experiments that originally were statistically limited to polymorphism probability models with small sequence context windows.…”
Section: Discussionsupporting
confidence: 89%
See 3 more Smart Citations
“…Here, we present Baymer, a Bayesian method to model mutation rate variation that computationally scales to large windows of nucleotide sequence context (S2 Text and S8 Table ), robustly manages sparse data through an efficient regularization strategy, and emits posterior probabilities that capture uncertainty in estimated probabilities. Consistent with previous studies [24][25][26], we show that expanded sequence context models in most current human datasets are overfit with classic empirical methods but considerably improve model performance when properly regularized. As a result, this method allows for renewed evaluation of experiments that originally were statistically limited to polymorphism probability models with small sequence context windows.…”
Section: Discussionsupporting
confidence: 89%
“…Next, we attempted to discover specific motifs that are enriched in the highest or lowest 1% of 9-mer polymorphism probabilities within each mutation type (S1 Text and S4 Table ). We recapitulate almost all previously reported motifs [19,23,25]. Consistent with previous reports, we identify a preponderance of repeat-rich motifs, which is perhaps due to the impact of slippage in introducing mutations [18].…”
Section: Sequence Context Motifs Are Correlated With Changes In Polym...supporting
confidence: 91%
See 2 more Smart Citations
“…Different mutational processes lead to indel mutations, so Roulette values cannot necessarily be adapted to model this mutation type. 21 We approximated per-gene joint distributions of indel mutation rates and deleteriousness scores as follows. First, we considered all possible exonic indels of length ≤10nt for which precomputed CADD scores were available for download and all possible intronic insertions of length 1nt and deletions of length ≤4nt for which precomputed SpliceAI scores were available for download.…”
Section: Methodsmentioning
confidence: 99%