2020
DOI: 10.1101/2020.03.03.975219
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Reducing reference bias using multiple population reference genomes

Abstract: Most sequencing data analyses start by aligning sequencing reads to a linear reference genome. But failure to account for genetic variation causes reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the "reference flow" alignment method that uses information from multiple population reference genomes to improve alignment accuracy and reduce reference… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
13
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
4
3

Relationship

2
5

Authors

Journals

citations
Cited by 13 publications
(13 citation statements)
references
References 48 publications
0
13
0
Order By: Relevance
“…Therefore, these differences in tree topologies could be partially explained by the use of genetically heterogeneous data sets. Moreover, its impact on tree reconstruction may be alleviated by using multiple references simultaneously or a reference pangenome instead [22,[60][61][62][63][64]. If data sets of isolates were homogenous (i.e., the isolates are equally close to the same reference) as the one employed by Lee and Behr [25], we would expect that read alignment performance and tree resolution would decrease as we select progressively distant reference genomes [23,24,28].…”
Section: Plos Computational Biologymentioning
confidence: 99%
“…Therefore, these differences in tree topologies could be partially explained by the use of genetically heterogeneous data sets. Moreover, its impact on tree reconstruction may be alleviated by using multiple references simultaneously or a reference pangenome instead [22,[60][61][62][63][64]. If data sets of isolates were homogenous (i.e., the isolates are equally close to the same reference) as the one employed by Lee and Behr [25], we would expect that read alignment performance and tree resolution would decrease as we select progressively distant reference genomes [23,24,28].…”
Section: Plos Computational Biologymentioning
confidence: 99%
“…Instead, large-scale reference panels from a wide range of populations can provide similar information [4,5]. Recent studies use such information to improve alignment accuracy and reduce biases in alignment [10][11][12], but there has been little work to incorporate population data with variant calling.…”
mentioning
confidence: 99%
“…Nevertheless, there are grounds for suspecting that this approach might 58 introduce biases depending on the reference used for mapping. Most of these errors originate in the 59 genetic differences between the reference and the read sequence data [18][19][20][21], and they can affect 60 subsequent analyses [22][23][24][25][26][27][28]. These include the identification of variants throughout the genome 61 (mainly single nucleotide polymorphisms [SNPs]) and phylogenetic tree construction, which are 62 essential steps for epidemiological and evolutionary inferences.…”
mentioning
confidence: 99%