2021
DOI: 10.1101/2021.12.06.471476
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A method to build extended sequence context models of point mutations and indels

Abstract: The mutation rate of a specific position in the human genome depends on the sequence context surrounding it. Modeling the mutation rate by estimating a rate for each possible k-mer, however, only works for small values of k since the data becomes too sparse for larger values of k. Here we propose a new method that solves this problem by grouping similar k-mers using IUPAC patterns. We refer to the method as k-mer pattern partition and have implemented it in a software package called kmerPaPa. We use a large se… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

2
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(4 citation statements)
references
References 38 publications
2
2
0
Order By: Relevance
“…Here, we present Baymer, a Bayesian method to model mutation rate variation that computationally scales to large windows of nucleotide sequence context, robustly manages sparse data through an efficient regularization strategy, and emits posterior probabilities that capture uncertainty in estimated probabilities. Consistent with previous studies [24][25][26] , we show that expanded sequence context models in most current human datasets are overfit with classic empirical methods but considerably improve model performance when properly regularized. As a result, this method allows for renewed evaluation of experiments that originally were statistically limited to polymorphism probability models with small sequence context windows.…”
Section: Discussionsupporting
confidence: 89%
See 2 more Smart Citations
“…Here, we present Baymer, a Bayesian method to model mutation rate variation that computationally scales to large windows of nucleotide sequence context, robustly manages sparse data through an efficient regularization strategy, and emits posterior probabilities that capture uncertainty in estimated probabilities. Consistent with previous studies [24][25][26] , we show that expanded sequence context models in most current human datasets are overfit with classic empirical methods but considerably improve model performance when properly regularized. As a result, this method allows for renewed evaluation of experiments that originally were statistically limited to polymorphism probability models with small sequence context windows.…”
Section: Discussionsupporting
confidence: 89%
“…2E). This implies a considerable impact on polymorphism probabilities in extended sequence contexts, consistent with previous work 19,[23][24][25] . This general trend is similarly consistent across mutation types (Fig.…”
Section: Larger Contexts Best Explain Patterns Of Variation Genome-widesupporting
confidence: 90%
See 1 more Smart Citation
“…The incorporation of transcription and replication directions makes the model strand-dependent with unequal rates for mutations of the same type on the two DNA strands. To our knowledge, strand-dependency has not been incorporated into existing context-dependent and regional mutation models 1,10,22 . In addition to the known epigenomic features listed above, unexplained regional variation in the mutation rate has been observed 17,18 .…”
mentioning
confidence: 99%