2018
DOI: 10.1101/475194
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Noise-Cancelling Repeat Finder: Uncovering tandem repeats in error-prone long-read sequencing data

Abstract: Tandem DNA repeats can be sequenced with long-read technologies, but cannot be accurately deciphered due to the lack of computational tools taking high error rates of these technologies into account. Here we introduce Noise-Cancelling Repeat Finder (NCRF) to uncover putative tandem repeats of specified motifs in noisy long reads produced by Pacific Biosciences and Oxford Nanopore sequencers. Using simulations, we validated the use of NCRF to locate tandem repeats with motifs of various lengths and demonstrated… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 21 publications
0
3
0
Order By: Relevance
“…In total, five error rates with different substitution: insertion: deletion ratios (13%, 41:23:36, 15%-a, 37:42:21, 15%-b, 11:60:29, 16%, 28:24:48 and 20%, 48:15:37), five repeat pattern sizes (100, 500, 1000, 2000 and 3000) and five copy numbers (2, 3, 5, 10 and 20) were used to generate 15 simulated datasets (Tables 1–3). The five error rates and error distributions come from five public real datasets (PacBio: 15%-a, 15%-b and ONT: 13%, 16%, 20%) (Weirather et al , 2017; Harris et al , 2018). For each simulated dataset, 1000 reads were generated.…”
Section: Resultsmentioning
confidence: 99%
“…In total, five error rates with different substitution: insertion: deletion ratios (13%, 41:23:36, 15%-a, 37:42:21, 15%-b, 11:60:29, 16%, 28:24:48 and 20%, 48:15:37), five repeat pattern sizes (100, 500, 1000, 2000 and 3000) and five copy numbers (2, 3, 5, 10 and 20) were used to generate 15 simulated datasets (Tables 1–3). The five error rates and error distributions come from five public real datasets (PacBio: 15%-a, 15%-b and ONT: 13%, 16%, 20%) (Weirather et al , 2017; Harris et al , 2018). For each simulated dataset, 1000 reads were generated.…”
Section: Resultsmentioning
confidence: 99%
“…2) The pipelines used to annotate the repeats were not able to identify the satellite DNA arrays. According to [46], computational tools that take into account the high error rates of long-read technologies are lacking [46]. The employment of personalized pipelines, such as in [6] [47] [48], or tools specifically designed for satellite analysis in long-reads, like NCRF [46], tandem-genotypes [49], P ACMON STR [50], TandemTools [51] and Winnowmap2 [52], could improve the satellite DNA annotation results.…”
Section: Satellite Dnamentioning
confidence: 99%
“…The combination of cytogenetics and genomics studies has proven to be useful in elucidating numerous aspects of genome evolution and organization [30,31], with particular emphasis on repetitive DNAs [6,[32][33][34]. Furthermore, due to their tandemly repeated genomic organization, satDNA studies in non-model organisms were boosted in the last few years, especially with the development of several assembly-free pipelines designed for using raw reads [35][36][37][38]. In this context, several satDNA catalogs were characterized in a variety of invertebrate and vertebrate species [6,34,[39][40][41][42][43][44].…”
Section: Introductionmentioning
confidence: 99%