2022
DOI: 10.7717/peerj.13821
|View full text |Cite
|
Sign up to set email alerts
|

Benchmark datasets for SARS-CoV-2 surveillance bioinformatics

Abstract: Background Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the cause of coronavirus disease 2019 (COVID-19), has spread globally and is being surveilled with an international genome sequencing effort. Surveillance consists of sample acquisition, library preparation, and whole genome sequencing. This has necessitated a classification scheme detailing Variants of Concern (VOC) and Variants of Interest (VOI), and the rapid expansion of bioinformatics tools for sequence analysis. These… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
21
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(22 citation statements)
references
References 35 publications
1
21
0
Order By: Relevance
“…From a bioinformatic perspective, AusTrakka increased its minimum genome coverage criteria from >50% to >=90% (ACGT bases), which was the basis for the internal quality criterium of several participants. Regardless of AusTrakka participation, we agree with the conclusions of Lau et al (2022) and recommend this criterion as a minimum given the shared characteristics of the discordant metrics observed here (LB01-BS08, LB10-BS03, and LB10-BS08) being genome coverage <80% in addition to previous benchmarking studies (8,21). These developments further demonstrate the need for living guidelines to continually review and update the quality standards for SARS-CoV-2 WGS.…”
Section: Discussionsupporting
confidence: 90%
“…From a bioinformatic perspective, AusTrakka increased its minimum genome coverage criteria from >50% to >=90% (ACGT bases), which was the basis for the internal quality criterium of several participants. Regardless of AusTrakka participation, we agree with the conclusions of Lau et al (2022) and recommend this criterion as a minimum given the shared characteristics of the discordant metrics observed here (LB01-BS08, LB10-BS03, and LB10-BS08) being genome coverage <80% in addition to previous benchmarking studies (8,21). These developments further demonstrate the need for living guidelines to continually review and update the quality standards for SARS-CoV-2 WGS.…”
Section: Discussionsupporting
confidence: 90%
“…Lower quality datasets are highly useful for optimizing, validating, verifying and benchmarking the performance of algorithms, pipelines and instruments, as well as training new personnel (Figure 1). An example of the utility of high and low quality datasets can be seen in Xiaoli et al (2022) in which SARS-CoV-2 Nanopore/Illumina read datasets generated from public health genomic surveillance were shared as a collection to support benchmarking tools, understanding the genomic epidemiology of different lineages, and identifying variants of concern. The collection also contained a number of SARS-CoV-2 genomes of lower quality due to recognized errors and common sequencing failures (Xiaoli et al, 2022).…”
Section: Introductionmentioning
confidence: 99%
“…An example of the utility of high and low quality datasets can be seen in Xiaoli et al (2022) in which SARS-CoV-2 Nanopore/Illumina read datasets generated from public health genomic surveillance were shared as a collection to support benchmarking tools, understanding the genomic epidemiology of different lineages, and identifying variants of concern. The collection also contained a number of SARS-CoV-2 genomes of lower quality due to recognized errors and common sequencing failures (Xiaoli et al, 2022). Sharing sub-optimal data can be useful for the broader public health and research community, particularly when the data is carefully annotated with known issues so that it is not mistaken for better quality information, and can be more easily identified in repositories.…”
Section: Introductionmentioning
confidence: 99%
“…The appropriate selection of bioinformatic workflows and parameter thresholds is key to the accuracy of genomic data and varies depending on the sequencing approach and technology. While selection varies, some commonly accepted thresholds include the following: read depth (>=10 for Illumina, >=20 for Nanopore), phred score (>25) and genome coverage>=90 % [8]. Discrepancies at SARS-CoV-2 mutation sites can affect the interpretation of genomic metrics of clinical and public health importance, such as PANGO lineage classification, phylogenetic placement or genomic epidemiological clustering [9–11].…”
Section: Introductionmentioning
confidence: 99%