Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data

Kumaran, Manojkumar; Umadevi, S.; Devarajan, Bharanidharan

doi:10.1186/s12859-019-2928-9

Cited by 53 publications

(40 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In comparison, the F1-score before filtration is 0.715 for Monovar and 0.725 for BCFtools. These F1-scores are in-line with previously reported analysis on real and simulated data which is consistent with the total number of reads across all samples indicating that our simulation tool provides FASTQ reads consistent with real data [14,21].…”

Section: Visualization Of Simulated Readssupporting

confidence: 90%

SCSIM: Jointly simulating correlated single-cell and bulk next-generation DNA sequencing data

et al. 2020

View full text Add to dashboard Cite

Background: Recently, it has become possible to collect next-generation DNA sequencing data sets that are composed of multiple samples from multiple biological units where each of these samples may be from a single cell or bulk tissue. Yet, there does not yet exist a tool for simulating DNA sequencing data from such a nested sampling arrangement with single-cell and bulk samples so that developers of analysis methods can assess accuracy and precision. Results: We have developed a tool that simulates DNA sequencing data from hierarchically grouped (correlated) samples where each sample is designated bulk or single-cell. Our tool uses a simple configuration file to define the experimental arrangement and can be integrated into software pipelines for testing of variant callers or other genomic tools. Conclusions: The DNA sequencing data generated by our simulator is representative of real data and integrates seamlessly with standard downstream analysis tools.

show abstract

Section: Visualization Of Simulated Readssupporting

confidence: 90%

SCSIM: Jointly simulating correlated single-cell and bulk next-generation DNA sequencing data

et al. 2020

View full text Add to dashboard Cite

show abstract

“…The average coverage numbers, however, do not explain all of the differences that are observed. In Figure 2, the low coverage samples cause the peaks at the lower end of each distribution while the larger distributions show that the choice of analysis pipeline can have a large impact on the consistency of variant results, as has been described by us and others (Craig et al, 2016;Chen et al, 2019;Kumaran et al, 2019). A consistent analysis pipeline is expected to improve across-center consistency by up to 5%, assuming that the variance among replicates represents the maximal reproducibility across datasets.…”

Section: Discussionmentioning

confidence: 66%

A Distributed Whole Genome Sequencing Benchmark Study

et al. 2020

View full text Add to dashboard Cite

Population sequencing often requires collaboration across a distributed network of sequencing centers for the timely processing of thousands of samples. In such massive efforts, it is important that participating scientists can be confident that the accuracy of the sequence data produced is not affected by which center generates the data. A study was conducted across three established sequencing centers, located in Montreal, Toronto, and Vancouver, constituting Canada’s Genomics Enterprise (www.cgen.ca). Whole genome sequencing was performed at each center, on three genomic DNA replicates from three well-characterized cell lines. Secondary analysis pipelines employed by each site were applied to sequence data from each of the sites, resulting in three datasets for each of four variables (cell line, replicate, sequencing center, and analysis pipeline), for a total of 81 datasets. These datasets were each assessed according to multiple quality metrics including concordance with benchmark variant truth sets to assess consistent quality across all three conditions for each variable. Three-way concordance analysis of variants across conditions for each variable was performed. Our results showed that the variant concordance between datasets differing only by sequencing center was similar to the concordance for datasets differing only by replicate, using the same analysis pipeline. We also showed that the statistically significant differences between datasets result from the analysis pipeline used, which can be unified and updated as new approaches become available. We conclude that genome sequencing projects can rely on the quality and reproducibility of aggregate data generated across a network of distributed sites.

show abstract

“…At the end, the authors conclude that the combination of tools could increase performance but with the sacrifice of a vast amount of detected calls [60]. Similar conclusions of complementary algorithms were drawn in another study evaluating four variant callers using whole exome sequencing and simulated data [61]. These researchers also noted differences based on different aligner tools.…”

Section: Short Nucleotide Variantsmentioning

confidence: 59%

Comprehensive Outline of Whole Exome Sequencing Data Analysis Tools Available in Clinical Oncology

Bartha

Győrffy

2019

Cancers

View full text Add to dashboard Cite

Whole exome sequencing (WES) enables the analysis of all protein coding sequences in the human genome. This technology enables the investigation of cancer-related genetic aberrations that are predominantly located in the exonic regions. WES delivers high-throughput results at a reasonable price. Here, we review analysis tools enabling utilization of WES data in clinical and research settings. Technically, WES initially allows the detection of single nucleotide variants (SNVs) and copy number variations (CNVs), and data obtained through these methods can be combined and further utilized. Variant calling algorithms for SNVs range from standalone tools to machine learning-based combined pipelines. Tools for CNV detection compare the number of reads aligned to a dedicated segment. Both SNVs and CNVs help to identify mutations resulting in pharmacologically druggable alterations. The identification of homologous recombination deficiency enables the use of PARP inhibitors. Determining microsatellite instability and tumor mutation burden helps to select patients eligible for immunotherapy. To pave the way for clinical applications, we have to recognize some limitations of WES, including its restricted ability to detect CNVs, low coverage compared to targeted sequencing, and the missing consensus regarding references and minimal application requirements. Recently, Galaxy became the leading platform in non-command line-based WES data processing. The maturation of next-generation sequencing is reinforced by Food and Drug Administration (FDA)-approved methods for cancer screening, detection, and follow-up. WES is on the verge of becoming an affordable and sufficiently evolved technology for everyday clinical use.

show abstract

Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data

Cited by 53 publications

References 32 publications

SCSIM: Jointly simulating correlated single-cell and bulk next-generation DNA sequencing data

SCSIM: Jointly simulating correlated single-cell and bulk next-generation DNA sequencing data

A Distributed Whole Genome Sequencing Benchmark Study

Comprehensive Outline of Whole Exome Sequencing Data Analysis Tools Available in Clinical Oncology

Contact Info

Product

Resources

About