Differentiable Learning of Sequence-Specific Minimizer Schemes with DeepMinimizer

Hoang, Minh; Zheng, Hongyu; Kingsford, Carl

doi:10.1089/cmb.2022.0275

Cited by 6 publications

(19 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The density factor normalizes density for the window size w of the scheme. We follow the definition of Zheng et al [8]: for a sequence S the density factor is df (S) = factor removes the dependence on L, e.g. making the expected density factor of all random minimizers the same, regardless of k and L. Note that other works define the density factor simply as the density times a factor of (w +1) (c.f.…”

Section: Minimizer Densitymentioning

confidence: 99%

Efficient minimizer orders for large values ofkusing minimum decycling sets

Pellow

Ekim

et al. 2022

Preprint

View full text Add to dashboard Cite

Minimizers are ubiquitously used in data structures and algorithms for efficient searching, mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select a minimumk-mer in everyL-long sub-sequence of the target sequence, where minimality is with respect to a predefinedk-mer order. Commonly used minimizer orders select morek-mers overall than necessary and therefore provide limited improvement to runtime and memory usage of downstream analysis tasks. The recently introduced universalk-mer hitting sets produce minimizer orders resulting in fewer selectedk-mers. Unfortunately, generating compact universalk-mer hitting sets is currently infeasible fork >13, and thus cannot help in the many applications that need minimizers of largerk.Here, we close this gap by introducingdecycling set-based minimizer orders. We define new orders based on minimum decycling sets, which are guaranteed to hit any infinitely long sequence. We show that in practice these new minimizer orders select a number ofk-mers comparable to that of minimizer orders based on universalk-mer hitting sets, and can also scale up to largerk. Furthermore, we developed a query method that avoids the need to keep thek-mers of a decycling set in memory, which enables the use of these minimizer orders for any value ofk. We expect the new decycling set-based minimizer orders to improve the runtime and memory usage of algorithms and data structures in high-throughput DNA sequencing analysis.

show abstract

Section: Minimizer Densitymentioning

confidence: 99%

Efficient minimizer orders for large values ofkusing minimum decycling sets

Pellow

Ekim

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The sampling function of a minimizer scheme is characterized by a tuple of parameters ( w, k, π ), where w and k are defined above. Additionally, π is a total ordering over the set of all k -mers, which can be represented [7] as a scoring function , such that for every pair of k -mers k, κ′ ∈ Σ k : …”

Section: Background and Notationmentioning

confidence: 99%

“…Varying the mask configuration induces a spectrum of comparable schemes, including minimizers (i.e., all-ones mask) and various syncmer schemes (i.e., one-hot masks). This unification reveals a methodical approach to derive comparable sketching schemes via combining existing minimizer construction/optimization techniques [7, 5, 21, 22] with a mask optimization routine.…”

Section: Introductionmentioning

confidence: 99%

“…We propose a sequence-specific learning objective to optimize the masked minimizer scheme with respect to the GSS metric. Our objective function extends the D eep M inimizer loss function [7] with a secondary objective that minimizes the expected change in local ordering of k -mers under random mutations. Our approach consistently improves the GSS metric for every mask variant and outperform other known minimizer optimization approaches such as M iniception [21], PASHA [5] and D eep M inimizer [7].…”

Section: Introductionmentioning

confidence: 99%

“…Our objective function extends the D eep M inimizer loss function [7] with a secondary objective that minimizes the expected change in local ordering of k -mers under random mutations. Our approach consistently improves the GSS metric for every mask variant and outperform other known minimizer optimization approaches such as M iniception [21], PASHA [5] and D eep M inimizer [7]. We further conduct ablation studies to reveal practical insights about masked minimizers, such as the existence of trivial solutions, mask performance profiles and the ability to safeguard against a common minimizer pitfall.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Masked Minimizers: Unifying sequence sketching methods

Hoang

Marçais

Kingsford

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Minimizers and syncmers are sequence sketching methods that extract representative substrings from a long sequence. We show that both these sampling rules are different instantiations of a new unifying concept we call masked minimizers, which applies a sub-sampling binary mask on a minimizer sketch. This unification leads to the first formal procedure to meaningfully compare minimizers, syncmers and other comparable masked minimizers. We further demonstrate that existing sequence sketching metrics, such as density (which measures the sketch sparseness) and conservation (which measures the likelihood of the sketch being preserved under random mutations), should not be independently measured when evaluating masked minimizers. We propose a new metric that reflects the trade-off between these quantities called the generalized sketch score, or GSS. Finally, we introduce a sequence-specific and gradient-based learning objective that efficiently optimizes masked minimizer schemes with respect to the proposed GSS metric. We show that our method finds sketches with better overall density and conservation compared to existing expected and sequence-specific approaches, enabling more efficient and robust genomic analyses in the many settings where minimizers and syncmers are used.

show abstract