2021
DOI: 10.1371/journal.pcbi.1008678
|View full text |Cite
|
Sign up to set email alerts
|

One is not enough: On the effects of reference genome for the mapping and subsequent analyses of short-reads

Abstract: Mapping of high-throughput sequencing (HTS) reads to a single arbitrary reference genome is a frequently used approach in microbial genomics. However, the choice of a reference may represent a source of errors that may affect subsequent analyses such as the detection of single nucleotide polymorphisms (SNPs) and phylogenetic inference. In this work, we evaluated the effect of reference choice on short-read sequence data from five clinically and epidemiologically relevant bacteria (Klebsiella pneumoniae, Legion… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

2
38
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
3

Relationship

1
8

Authors

Journals

citations
Cited by 55 publications
(43 citation statements)
references
References 97 publications
(155 reference statements)
2
38
0
Order By: Relevance
“…The selection of a reference genome with a well-defined core genome can be a source of bias if it is not closely related to the analysed sequences and can affect the subsequent analyses such as the detection of genetic events 28 . This could explain why the cgMLST analysis performed with 1928 showed a lower estimated genomic variation rate compared to SeqSphere+ since they are based on two different references, i.e.…”
Section: Discussionmentioning
confidence: 99%
“…The selection of a reference genome with a well-defined core genome can be a source of bias if it is not closely related to the analysed sequences and can affect the subsequent analyses such as the detection of genetic events 28 . This could explain why the cgMLST analysis performed with 1928 showed a lower estimated genomic variation rate compared to SeqSphere+ since they are based on two different references, i.e.…”
Section: Discussionmentioning
confidence: 99%
“…Furthermore, for some applications, the underpinning genomic diversity of B. cereus s.l. cannot be ignored; for example, bioinformatic analyses used in WGS-based outbreak investigations (e.g., reference genome selection, single-nucleotide polymorphism [SNP] identification, phylogeny construction) can be affected by unexpected genomic diversity (Olson et al 2015;Pagotto 2014, 2015;Usongo et al 2018;Valiente-Mullor et al 2021). For B. cereus s.l.…”
Section: Bam Protocol For B Cereusmentioning
confidence: 99%
“…Our recent software publication raspir in combination with gapseq , a tool introduced by Zimmermann et al (2021) facilitated the taxonomic and functional identification of core and rare species from shotgun metagenomic sequencing data and reference genomes, respectively, with reduced false discovery and omission rates [27] , [28] . Since previous reports have demonstrated that metagenome investigations are affected by the reference database of choice [29] and the normalisation strategy of count data for addressing the compositional behaviour of microbiome sequencing data [30] , [31] , [32] , [33] , we tested our model simulations, random forest bootstrapping aggregations, ecological network analysis and kernel-based machine learning applications on infant metagenome datasets, generated from read alignments towards either a pan-genome or a one-strain-per-species reference database. Moreover, we generated datasets based on three different read count normalisation strategies, namely variance-stabilising transformations (VST), relative log expression (RLE) and bacterial to human cell ratios (BCPHC) and worked with three distinct rarity thresholds (15th, 25th and 35th species abundance percentile) to define the core and rare species biosphere.…”
Section: Introductionmentioning
confidence: 99%