The String Decomposition Problem and its Applications to Centromere Assembly

Dvorkina, Tatiana; Bzikadze, Andrey V.; Pevzner, Pavel A.

doi:10.1101/2019.12.26.888685

Cited by 3 publications

(7 citation statements)

References 20 publications

(32 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In case monomer sequences are known, TandemQUAST attempts to detect discrepancies between reads and the assembly at the monomer level. The assembled centromere and all reads are aligned to the provided monomer sequences and are subsequently translated into the monomer alphabet using the StringDecomposer tool ( Dvorkina et al , 2020 ), resulting in a monocentromere and monoreads .…”

Section: Methodsmentioning

confidence: 99%

TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats

et al. 2020

View full text Add to dashboard Cite

Motivation Extra-long tandem repeats (ETRs) are widespread in eukaryotic genomes and play an important role in fundamental cellular processes, such as chromosome segregation. Although emerging long-read technologies have enabled ETR assemblies, the accuracy of such assemblies is difficult to evaluate since there are no tools for their quality assessment. Moreover, since the mapping of error-prone reads to ETRs remains an open problem, it is not clear how to polish draft ETR assemblies. Results To address these problems, we developed the TandemTools software that includes the TandemMapper tool for mapping reads to ETRs and the TandemQUAST tool for polishing ETR assemblies and their quality assessment. We demonstrate that TandemTools not only reveals errors in ETR assemblies but also improves the recently generated assemblies of human centromeres. Availability and implementation https://github.com/ablab/TandemTools. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Section: Methodsmentioning

confidence: 99%

TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats

et al. 2020

View full text Add to dashboard Cite

show abstract

“…In case monomer sequences are known, tandemQUAST attempts to detect discrepancies between reads and the assembly at the monomer level. The assembled centromere and all reads are aligned to the provided monomer sequences and are subsequently translated into the monomer alphabet using the StringDecomposer tool (Dvorkina et al, 2019), resulting in a monocentromere and monoreads . Using nucleotide read alignments, for each monomer ReadMonomer in each monoread tandemQUAST calculates StartPos(ReadMonomer) , the starting nucleotide position of ReadMonomer in the monocentromere.…”

Section: Breakpoint Metricmentioning

confidence: 99%

“…If an assembly is represented as an array of monomers, tandemQUAST splits this array into repeated units (a sequence of monomers, e.g., a series of twelve monomers forming a HOR on cenX can be represented as m 1 m 2 ...m 12 ). To automatically derive a unit, tandemQUAST uses the StringDecomposer tool (Dvorkina et al, 2019) to translate the assembly from the nucleotide to the monomer alphabet (the alphabet size is the number of distinct monomers). Afterwards, it collects all t -mers in the monomer alphabet (the default value t =5), calculates the average distance d between two consecutive occurrences of the same t -mer, and selects the most frequent d-mer in the monomer alphabet as a standard unit.…”

Section: Appendix: Unit-based Statisticmentioning

confidence: 99%

TandemMapper and TandemQUAST: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats

Mikheenko

Gurevich

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Extra-long tandem repeats (ETRs) are widespread in eukaryotic genomes and play an important role in fundamental cellular processes, such as chromosome segregation. Although emerging long-read technologies have enabled ETR assemblies, the accuracy of such assemblies is difficult to evaluate since there is no standard tool for their quality assessment. Moreover, since the mapping of long error-prone reads to ETR remains an open problem, it is not clear how to polish draft ETR assemblies. To address these problems, we developed the tandemMapper tool for mapping reads to ETRs and the tandemQUAST tool for polishing ETR assemblies and their quality assessment. We demonstrate that tandemQUAST not only reveals errors in and evaluates ETR assemblies, but also improves them. To illustrate how tandemMapper and tandemQUAST work, we apply them to recently generated assemblies of human centromeres.

show abstract

“…A monomer is frequent if the number of monomer-blocks in its cluster exceeds a frequency threshold, and infrequent, otherwise. Recently, Uralsky et al, 2019, and Dvorkina et al, 2020 revealed still underexplored hybrid monomers (each hybrid monomer is a concatenate of two or even more frequent monomers) and hypothesized that such hybridization may drive the "birth" of new frequent monomers. Different human centromeres typically have different monomers and units while the number of the frequent monomers in a unit varies from 2 for chromosome 19 to 19 for chromosome 4.…”

Section: Introductionmentioning

confidence: 99%

“…Recent evolutionary studies of centromeres (Uralsky et al, 2019, Suzuki et al, 2020 revealed the importance of partitioning them into monomers, the problem that was addressed by the StringDecomposer algorithm (Dvorkina et al, 2020). Given a nucleotide string Centromere and a monomer-set Monomers, StringDecomposer decomposes the Centromere into monomer-blocks (each block is similar to one of the monomers) and transforms Centromere into a monocentromere string Centromere* over the alphabet of monomers.…”

Section: Introductionmentioning

confidence: 99%

HORmon: automated annotation of human centromeres

Kunyavskaya

Dvorkina

Bzikadze

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Recent advances in long-read sequencing opened a possibility to address the long-standing questions about the architecture and evolution of human centromeres. They also emphasized the need for centromere annotation (partitioning human centromeres into monomers and higher-order repeats (HORs)). Even though there was a half-century-long series of semi-manual studies of centromere architecture, a rigorous centromere annotation algorithm is still lacking. Moreover, an automated centromere annotation is a prerequisite for studies of genetic diseases associated with centromeres, and evolutionary studies of centromeres across multiple species. Although the monomer decomposition (transforming a centromere into a monocentromere written in the monomer alphabet) and the HOR decomposition (representing a monocentromere in the alphabet of HORs) are currently viewed as two separate problems, we demonstrate that they should be integrated into a single framework in such a way that HOR (monomer) inference affects monomer (HOR) inference. We thus developed the HORmon algorithm that integrates the monomer/HOR inference and automatically generates the human monomers/HORs that are largely consistent with the previous semi-manual inference.

show abstract

The String Decomposition Problem and its Applications to Centromere Assembly

Cited by 3 publications

References 20 publications

TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats

TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats

TandemMapper and TandemQUAST: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats

HORmon: automated annotation of human centromeres

Contact Info

Product

Resources

About