2019
DOI: 10.1101/2019.12.26.888685
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The String Decomposition Problem and its Applications to Centromere Assembly

Abstract: Recent attempts to assemble long tandem repeats (such as multi-megabase long centromeres) faced the challenge of accurate translation of long error-prone reads from the nucleotide alphabet into the alphabet of repeat units . Centromeres represent a particularly complex type of nested tandem repeats , where each unit is itself a repeat formed by chromosome-specific monomers (a repeat within repeat). Given a set of monomers forming a specific centromere, translation of a read into monomers is modeled as the Stri… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
2
1

Relationship

2
1

Authors

Journals

citations
Cited by 3 publications
(7 citation statements)
references
References 20 publications
(32 reference statements)
0
7
0
Order By: Relevance
“…In case monomer sequences are known, TandemQUAST attempts to detect discrepancies between reads and the assembly at the monomer level. The assembled centromere and all reads are aligned to the provided monomer sequences and are subsequently translated into the monomer alphabet using the StringDecomposer tool ( Dvorkina et al , 2020 ), resulting in a monocentromere and monoreads .…”
Section: Methodsmentioning
confidence: 99%
“…In case monomer sequences are known, TandemQUAST attempts to detect discrepancies between reads and the assembly at the monomer level. The assembled centromere and all reads are aligned to the provided monomer sequences and are subsequently translated into the monomer alphabet using the StringDecomposer tool ( Dvorkina et al , 2020 ), resulting in a monocentromere and monoreads .…”
Section: Methodsmentioning
confidence: 99%
“…In case monomer sequences are known, tandemQUAST attempts to detect discrepancies between reads and the assembly at the monomer level. The assembled centromere and all reads are aligned to the provided monomer sequences and are subsequently translated into the monomer alphabet using the StringDecomposer tool (Dvorkina et al, 2019), resulting in a monocentromere and monoreads . Using nucleotide read alignments, for each monomer ReadMonomer in each monoread tandemQUAST calculates StartPos(ReadMonomer) , the starting nucleotide position of ReadMonomer in the monocentromere.…”
Section: Breakpoint Metricmentioning
confidence: 99%
“…If an assembly is represented as an array of monomers, tandemQUAST splits this array into repeated units (a sequence of monomers, e.g., a series of twelve monomers forming a HOR on cenX can be represented as m 1 m 2 ...m 12 ). To automatically derive a unit, tandemQUAST uses the StringDecomposer tool (Dvorkina et al, 2019) to translate the assembly from the nucleotide to the monomer alphabet (the alphabet size is the number of distinct monomers). Afterwards, it collects all t -mers in the monomer alphabet (the default value t =5), calculates the average distance d between two consecutive occurrences of the same t -mer, and selects the most frequent d-mer in the monomer alphabet as a standard unit.…”
Section: Appendix: Unit-based Statisticmentioning
confidence: 99%
“…A monomer is frequent if the number of monomer-blocks in its cluster exceeds a frequency threshold, and infrequent, otherwise. Recently, Uralsky et al, 2019, and Dvorkina et al, 2020 revealed still underexplored hybrid monomers (each hybrid monomer is a concatenate of two or even more frequent monomers) and hypothesized that such hybridization may drive the "birth" of new frequent monomers. Different human centromeres typically have different monomers and units while the number of the frequent monomers in a unit varies from 2 for chromosome 19 to 19 for chromosome 4.…”
Section: Introductionmentioning
confidence: 99%
“…Recent evolutionary studies of centromeres (Uralsky et al, 2019, Suzuki et al, 2020 revealed the importance of partitioning them into monomers, the problem that was addressed by the StringDecomposer algorithm (Dvorkina et al, 2020). Given a nucleotide string Centromere and a monomer-set Monomers, StringDecomposer decomposes the Centromere into monomer-blocks (each block is similar to one of the monomers) and transforms Centromere into a monocentromere string Centromere* over the alphabet of monomers.…”
Section: Introductionmentioning
confidence: 99%