2018
DOI: 10.1186/s12859-018-2438-1
|View full text |Cite
|
Sign up to set email alerts
|

Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis

Abstract: BackgroundThe current versions of reference genome assemblies still contain gaps represented by stretches of Ns. Since high throughput sequencing reads cannot be mapped to those gap regions, the regions are depleted of experimental data. Moreover, several technology platforms assay a targeted portion of the genomic sequence, meaning that regions from the unassayed portion of the genomic sequence cannot be detected in those experiments. We here refer to all such regions as inaccessible regions, and hypothesize … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
6
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 11 publications
(11 citation statements)
references
References 17 publications
0
11
0
Order By: Relevance
“…Scaffold N50 indicates the minimum scaffold size among the largest scaffolds making up half of the assembly, while BUSCO values measure the number of complete/incomplete/missing core genes in the assembly. However, genome completeness goes beyond scaffold N50 and gene presence (Domanska et al., 2018; Sedlazeck et al., 2018; Thomma et al., 2016). Genes usually occupy a small fraction of genomes and new sequencing technologies commonly yield high N50 values.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…Scaffold N50 indicates the minimum scaffold size among the largest scaffolds making up half of the assembly, while BUSCO values measure the number of complete/incomplete/missing core genes in the assembly. However, genome completeness goes beyond scaffold N50 and gene presence (Domanska et al., 2018; Sedlazeck et al., 2018; Thomma et al., 2016). Genes usually occupy a small fraction of genomes and new sequencing technologies commonly yield high N50 values.…”
Section: Discussionmentioning
confidence: 99%
“…and gene presence (Domanska et al, 2018;Sedlazeck et al, 2018;Thomma et al, 2016). Genes usually occupy a small fraction of genomes and new sequencing technologies commonly yield high N50…”
Section: How Complete Are Genome Assemblies?mentioning
confidence: 99%
See 1 more Smart Citation
“…As noted above, well known deficiencies in genome assemblies include difficulty in assembling repetitive, duplicated, and GC rich regions that can often be addressed with long‐read sequencing (Sedlazeck et al, 2018), but at the expense of sequencing error which may influence estimates of gene content (Jaworski et al, 2020; Watson & Warr, 2019). The trade‐offs of various sequencing technologies can be offset by applying multiple platforms (Peona et al, 2021; Rhie, McCarthy, et al, 2020), however many chromosome‐level genome assemblies have extensive gaps within and among scaffolds that may hinder utility of these hybrid assemblies for subsequent studies (Domanska et al, 2018; Peona et al, 2021). Thus, researchers should strive to accurately report the quality of genome assemblies with key statistics that account for fragmentation, accuracy, and gene content.…”
Section: Evaluating Assembliesmentioning
confidence: 99%
“…The 2.26 Gb assembled genome from a female marmoset, although was sorted out into chromosomes, contained many shorter contigs and also 187,214 gap regions. These hard to assemble gap regions cannot be ignored, as they can lead to false positive results [8], and the gap regions could harbor many functionally relevant genes [9]. Recent studies have uncovered that many genes were wrongly labelled as missing in bird genomes, because of the locality of those genes being GCrich and hence had posed challenges in identifying them [9].…”
Section: Introductionmentioning
confidence: 99%