Reducing reference bias using multiple population reference genomes

Chen, Nae-Chyun; Solomon, Brad; Mun, Taher; Iyer, Sheila; Langmead, Ben

doi:10.1101/2020.03.03.975219

Cited by 13 publications

(13 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, these differences in tree topologies could be partially explained by the use of genetically heterogeneous data sets. Moreover, its impact on tree reconstruction may be alleviated by using multiple references simultaneously or a reference pangenome instead [22,[60][61][62][63][64]. If data sets of isolates were homogenous (i.e., the isolates are equally close to the same reference) as the one employed by Lee and Behr [25], we would expect that read alignment performance and tree resolution would decrease as we select progressively distant reference genomes [23,24,28].…”

Section: Plos Computational Biologymentioning

confidence: 99%

One is not enough: On the effects of reference genome for the mapping and subsequent analyses of short-reads

et al. 2021

View full text Add to dashboard Cite

Mapping of high-throughput sequencing (HTS) reads to a single arbitrary reference genome is a frequently used approach in microbial genomics. However, the choice of a reference may represent a source of errors that may affect subsequent analyses such as the detection of single nucleotide polymorphisms (SNPs) and phylogenetic inference. In this work, we evaluated the effect of reference choice on short-read sequence data from five clinically and epidemiologically relevant bacteria (Klebsiella pneumoniae, Legionella pneumophila, Neisseria gonorrhoeae, Pseudomonas aeruginosa and Serratia marcescens). Publicly available whole-genome assemblies encompassing the genomic diversity of these species were selected as reference sequences, and read alignment statistics, SNP calling, recombination rates, dN/dS ratios, and phylogenetic trees were evaluated depending on the mapping reference. The choice of different reference genomes proved to have an impact on almost all the parameters considered in the five species. In addition, these biases had potential epidemiological implications such as including/excluding isolates of particular clades and the estimation of genetic distances. These findings suggest that the single reference approach might introduce systematic errors during mapping that affect subsequent analyses, particularly for data sets with isolates from genetically diverse backgrounds. In any case, exploring the effects of different references on the final conclusions is highly recommended.

show abstract

Section: Plos Computational Biologymentioning

confidence: 99%

One is not enough: On the effects of reference genome for the mapping and subsequent analyses of short-reads

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Instead, large-scale reference panels from a wide range of populations can provide similar information [4,5]. Recent studies use such information to improve alignment accuracy and reduce biases in alignment [10][11][12], but there has been little work to incorporate population data with variant calling.…”

mentioning

confidence: 99%

Improving variant calling using population data and deep learning

Chen

Kolesnikov

Goel

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Large-scale population variant data is often used to filter and aid interpretation of variant calls in a single sample. These approaches do not incorporate population information directly into the process of variant calling, and are often limited to filtering which trades recall for precision. In this study, we modify DeepVariant to add a new channel encoding population allele frequencies from the 1000 Genomes Project. We show that this model reduces variant calling errors, improving both precision and recall. We assess the impact of using population-specific or diverse reference panels. We achieve the greatest accuracy with diverse panels, suggesting that large, diverse panels are preferable to individual populations, even when the population matches sample ancestry. Finally, we show that this benefit generalizes to samples with different ancestry from the training data even when the ancestry is also excluded from the reference panel.

show abstract

“…Nevertheless, there are grounds for suspecting that this approach might 58 introduce biases depending on the reference used for mapping. Most of these errors originate in the 59 genetic differences between the reference and the read sequence data [18][19][20][21], and they can affect 60 subsequent analyses [22][23][24][25][26][27][28]. These include the identification of variants throughout the genome 61 (mainly single nucleotide polymorphisms [SNPs]) and phylogenetic tree construction, which are 62 essential steps for epidemiological and evolutionary inferences.…”

mentioning

confidence: 99%

One is not enough: on the effects of reference genome for the mapping and subsequent analyses of short-reads

Valiente-Mullor

Beamud

Ansari

et al. 2020

Preprint

View full text Add to dashboard Cite

16 Mapping of high-throughput sequencing (HTS) reads to a single arbitrary reference genome is a 17 frequently used approach in microbial genomics. However, the choice of a reference may represent a 18 source of errors that may affect subsequent analyses such as the detection of single nucleotide 19 polymorphisms (SNPs) and phylogenetic inference. In this work, we evaluated the effect of reference 20 choice on short-read sequence data from five clinically and epidemiologically relevant bacteria 21 (Klebsiella pneumoniae, Legionella pneumophila, Neisseria gonorrhoeae, Pseudomonas aeruginosa 22 and Serratia marcescens). Publicly available whole-genome assemblies encompassing the genomic 23 diversity of these species were selected as reference sequences, and read alignment statistics, SNP 24 calling, recombination rates, dN/dS ratios, and phylogenetic trees were evaluated depending on the 25 mapping reference. The choice of different reference genomes proved to have an impact on almost all 26 the parameters considered in the five species. In addition, these biases had potential epidemiological 27 implications such as including/excluding isolates of particular clades and the estimation of genetic 28 distances. These findings suggest that the single reference approach might introduce systematic errors 29 during mapping that affect subsequent analyses, particularly for data sets with isolates from 30 genetically diverse backgrounds. In any case, exploring the effects of different references on the final 31 conclusions is highly recommended. 32 33 Author summary 34 Mapping consists in the alignment of reads (i.e., DNA fragments) obtained through high-throughput 35 genome sequencing to a previously assembled reference sequence. It is a common practice in genomic 36 studies to use a single reference for mapping, usually the 'reference genome' of a species -a high-37 quality assembly. However, the selection of an optimal reference is hindered by intrinsic intra-species 38 genetic variability, particularly in bacteria. Biases/errors due to reference choice for mapping in 39 bacteria have been identified. These are mainly originated in alignment errors due to genetic 40 differences between the reference genome and the read sequences. Eventually, they could lead to 41 misidentification of variants and biased reconstruction of phylogenetic trees (which reflect ancestry 42 between different bacterial lineages). However, a systematic work on the effects of reference choice 43 in different bacterial species is still missing, particularly regarding its impact on phylogenies. This 44 work intended to fill that gap. The impact of reference choice has proved to be pervasive in the five 45 bacterial species that we have studied and, in some cases, alterations in phylogenetic trees could lead 46 to incorrect epidemiological inferences. Hence, the use of different reference genomes may be 47 prescriptive to assess the potential biases of mapping. 48 49 Introduction 50 The development and increasing availability of high-throughput sequen...

show abstract

Reducing reference bias using multiple population reference genomes

Cited by 13 publications

References 48 publications

One is not enough: On the effects of reference genome for the mapping and subsequent analyses of short-reads

One is not enough: On the effects of reference genome for the mapping and subsequent analyses of short-reads

Improving variant calling using population data and deep learning

One is not enough: on the effects of reference genome for the mapping and subsequent analyses of short-reads

Contact Info

Product

Resources

About