2019
DOI: 10.1186/s12864-019-5957-x
|View full text |Cite
|
Sign up to set email alerts
|

Evaluating the quality of the 1000 genomes project data

Abstract: Background Data from the 1000 Genomes project is quite often used as a reference for human genomic analysis. However, its accuracy needs to be assessed to understand the quality of predictions made using this reference. We present here an assessment of the genotyping, phasing, and imputation accuracy data in the 1000 Genomes project. We compare the phased haplotype calls from the 1000 Genomes project to experimentally phased haplotypes for 28 of the same individuals sequenced using the 10X Genomic… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
36
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2
1

Relationship

0
10

Authors

Journals

citations
Cited by 51 publications
(45 citation statements)
references
References 31 publications
1
36
1
Order By: Relevance
“…However, for genomes with very high mutation rates, such as HIV-1 [68], recurrence is sufficiently high to make estimates of allele age meaningless. In addition, while we have shown GEVA to be robust to realistic levels of sequencing and haplotype phasing error, the actual structures of error found in reference data sources, such as TGP [69], have additional complexity whose effect is unknown.…”
Section: Discussionmentioning
confidence: 98%
“…However, for genomes with very high mutation rates, such as HIV-1 [68], recurrence is sufficiently high to make estimates of allele age meaningless. In addition, while we have shown GEVA to be robust to realistic levels of sequencing and haplotype phasing error, the actual structures of error found in reference data sources, such as TGP [69], have additional complexity whose effect is unknown.…”
Section: Discussionmentioning
confidence: 98%
“…However, for genomes with very high mutation rates, such as HIV-1 [72], recurrence is sufficiently high to make estimates of allele age meaningless. In addition, while we have shown GEVA to be robust to realistic levels of sequencing and haplotype phasing error, the actual structures of error found in reference data sources, such as the TGP, have additional complexity whose effect is unknown [73].…”
Section: Discussionmentioning
confidence: 98%
“…It was also not surprising to find that the variant allele frequency for c.833T>C was underestimated when it occurred in cis given that the 68 bp insertion, which also includes the c.833T wild-type base, is almost identical in sequence to the reference genome. Phasing and imputation errors of rare variants in the 1000 Genomes data have been attributed to the limited sample size 35 . Our findings suggest, though, that the c.[833T>C;844_845ins68] complex variant may have remained undetected in the 1000 Genomes samples as a result of the alignment and variant calling methods used in the original NGS analysis, and that there may be other complex or rare variants in these data that also have gone underreported.…”
Section: Discussionmentioning
confidence: 99%