2020
DOI: 10.1093/bioinformatics/btaa454
|View full text |Cite
|
Sign up to set email alerts
|

The string decomposition problem and its applications to centromere analysis and assembly

Abstract: Motivation Recent attempts to assemble extra-long tandem repeats (such as centromeres) faced the challenge of translating long error-prone reads from the nucleotide alphabet into the alphabet of repeat units. Human centromeres represent a particularly complex type of high-order repeats (HORs) formed by chromosome-specific monomers. Given a set of all human monomers, translating a read from a centromere into the monomer alphabet is modeled as the String Decomposition Problem. The accurate tran… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
47
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6

Relationship

1
5

Authors

Journals

citations
Cited by 33 publications
(51 citation statements)
references
References 32 publications
(65 reference statements)
1
47
0
Order By: Relevance
“…First, the sequence scaffold and underlying ONT reads were subjected to RepeatMasker (v.3.3.0) to ensure that read overlaps were concordant in repeat structure. Second, the centromeric scaffold and underlying ONT reads were subjected to StringDecomposer 42 to validate the HOR organization in overlapping reads. Finally, the sequence scaffold for each target region was incorporated into the CHM13 chromosome 8 assembly previously generated 5 , thereby filling the gaps in the chromosome 8 assembly.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…First, the sequence scaffold and underlying ONT reads were subjected to RepeatMasker (v.3.3.0) to ensure that read overlaps were concordant in repeat structure. Second, the centromeric scaffold and underlying ONT reads were subjected to StringDecomposer 42 to validate the HOR organization in overlapping reads. Finally, the sequence scaffold for each target region was incorporated into the CHM13 chromosome 8 assembly previously generated 5 , thereby filling the gaps in the chromosome 8 assembly.…”
Section: Methodsmentioning
confidence: 99%
“…The in silico digestion analysis indicates that the chromosome 8 centromeric HOR array consists of 1,462 HOR units: 283 4-monomer HORs, 4 5-monomer HORs, 13 6-monomer HORs, 356 7- monomer HORs, 295 8-monomer HORs, and 511 11-monomer HORs. As an alternative approach, we subjected the centromere assembly to StringDecomposer 42 ( https://github.com/ablab/stringdecomposer ; version from 28 February 2020) using a set of 11 α-satellite monomers derived from a chromosome 8 11-mer HOR unit. The sequence of the α-satellite monomers used are as follows: A: AGCATTCTCAGAAACACCTTCGTGATGTTTGCAATCAAGTCACAGAGTTGAACCTTCCGTTTCATAGAGCAGGTTGGAAACA CTCTTATTGTAGTATCTGGAAGTGGACATTTGGAGCGCTTTCAGGCCTATGGTGAAAAAGGAAATATCTTCCCATAAAAACGACATAGA; B: AGCTATCTCAGGAACTTGTTTATGATGCATCTAATCAACTAACAGTGTTGAACCTTTGTACTGACAG AGCACTTTGAAACACTCTTTTTTGGAATCTGCAAGTGGATATTTGGATCGCTTTGAGGATTTCGTTGGAAACGGGATGCAATATAAAACGTACACAGC; C: AGCATACTCAGAAAATACTTTGCCATATTTCCATTCAAGTCACAGAGTGGAACATTCCCATTCATAGAGCAGGTTGGAAACACTCTTTTTGGAGTATCTGGAAGTGGACATTTGGAGCGCTTTCTGAACTATGGTGAAAAAGGAAATATCTTCCAATGAAAACAAGACAGA; D: AGCATTCTGAGAAACTTATTTGTGATGTGTGTCCTCAACAAACGGACTTGAACCTTTCGTTTCATGCAGTACTTCTGGAACACTCTTTTT GAAGATTCTGCATGCGGATATTTGGATAGCTTTGAGGATTTCGTTGGAAACGGGCTTACATGTAAAAATTAGACAGC; E: AGCATTCTCAGAAACTTCTTTGTGGTG TCTGCATTCAAGTCACAGAATTGAACTTCTCCTCACATAGAGCAGTTGTGCAGCACTCTATTTGTAGTATCTGGAAGTGGACATTTGGAGGGCTTTGTAGCCTATCTGGAAAAAGGAAATATCTTCCCATGAATGCGAGATAGA; F: AGTAATCTCAGAAACATGTTTATGCTGTATCTACTCAACTAACTGTGCTGAACATTTCTATTGATAGAGCAGTTTTGAGACCCTCTTCTTTTGGAATCTGCAAGTGGATATTTGGATAGATTTGAGGATTTCGTTGGAAACGGGATTATATATAAAAAGTAGACAGC; G: AGCATTCTCAGAAACTTCTTTGTGATGTTTGCATCCAGCTCTCAGAGTTGAACATTCCCTTTCATAGAGTAGGTTTGAAACCCTCTTTTTATAGTGTCTGGAAGCGGGCATTTGGAGCGCTTTCAGGCCTATGCTGAAAAAGGAAATATCTACATATAGAAACTAGACAGA; H: AGCATTCTGAGAATCAAGTTTGTGATGTGGGTACTCAACTAACAGTGTTGATCCATTCTTTTGATACAGCAGTTTTGAACCACACTTTTTGTAGAATCTGCAAGTGGATATTTGGATAGCTGTGAGGATTTCGTTGGAAACGGGAATGTCTTCATAGAAAATTTAGACAGA; I: AGCATTCTCAGAACCTTGATTGTGATGTGTGTTCTCCACTAACAGAGTTGAACCTTTCTTTTGACAGAACTGTTCTGAAACATTCTTTTTATAGAATCTGGAAGTGGATATTTGGAAAGCTTTGAGGATTTCGTTGGAAACGGGAATATCTTCAAATAAAATCTAGCCAGA; J: AGCATTCTAAGAAACATCTTAGGGATGTTTACATTCAAGTCACAGAGTTGAACATTCC CTTTCACAGAGCAGGTTTGAAACAATCTTCTCGTACTATCTGGCAGTGGACATTTTGAGCTCTTTGGGGCCTATGCTGAAAAAGGAAATATCTTCCGACAAAAACTAGTCAGA; K: AGCATTCGCAGAATCCCGTTTGTGATGTGTGCACTCAACTGTCAGAATTGAACCTTGGTTTGGAGAGAGCACTTTTGAAACACACT TTTTGTAGAATCTGCAGGTGGATATTTGGCT AGCTTTGAGGATTTCGTTGGAAACGGTAATGTCTTCAAAGAAAATCTAGACAGA.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…Recent developments in long, accurate long reads (Hifi) as well as ultra-long nanopore reads could pave the way for new advancements in finishing centromeric and other highly repetitive regions in humans and polyploids. Computationally, Hifi reads can be decomposed into monomers [ 139 ] that are represented in the graph, where monomers are nodes and edges represent the adjacencies of node sequences from reads. In this process, the haplotype variation is also considered in the monomers that can result in a haplotype-aware graph.…”
Section: Remaining Challenges and Perspectivesmentioning
confidence: 99%