The repetitive nature and complexity of some medically relevant genes poses a challenge for their accurate analysis in a clinical setting. The Genome in a Bottle Consortium has provided variant benchmark sets, but these exclude nearly four hundred medically relevant genes due to their repetitiveness or polymorphic complexity. Here we characterize 273 of these 395 challenging autosomal genes using a haplotype-resolved whole-genome assembly. This curated benchmark reports over 17,000 single nucleotide variations, 3,600 INDELs, and 200 structural variations each for human genome reference GRCh37 and GRCh38 across HG002. We show that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically relevant genes including CBS , CRYAA , and KCNE1 . When masking these false duplications, variant recall can improve from 8% to 100%. Forming benchmarks from a haplotype-resolved whole-genome assembly may become a prototype for future benchmarks covering the whole genome.
SummaryThe precisionFDA Truth Challenge V2 aimed to assess the state-of-the-art of variant calling in difficult-to-map regions and the Major Histocompatibility Complex (MHC). Starting with FASTQ files, 20 challenge participants applied their variant calling pipelines and submitted 64 variant callsets for one or more sequencing technologies (~35X Illumina, ~35X PacBio HiFi, and ~50X Oxford Nanopore Technologies). Submissions were evaluated following best practices for benchmarking small variants with the new GIAB benchmark sets and genome stratifications. Challenge submissions included a number of innovative methods for all three technologies, with graph-based and machine-learning methods scoring best for short-read and long-read datasets, respectively. New methods out-performed the 2016 Truth Challenge winners, and new machine-learning approaches combining multiple sequencing technologies performed particularly well. Recent developments in sequencing and variant calling have enabled benchmarking variants in challenging genomic regions, paving the way for the identification of previously unknown clinically relevant variants.
The repetitive nature and complexity of multiple medically important genes make them intractable to accurate analysis, despite the maturity of short-read sequencing, resulting in a gap in clinical applications of genome sequencing. The Genome in a Bottle Consortium has provided benchmark variant sets, but these excluded some medically relevant genes due to their repetitiveness or polymorphic complexity. In this study, we characterize 273 of these 395 challenging autosomal genes that have multiple implications for medical sequencing. This extended, curated benchmark reports over 17,000 SNVs, 3,600 INDELs, and 200 SVs each for GRCh37 and GRCh38. We show that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically important genes including CBS, CRYAA, and KCNE1. Our proposed solution improves variant recall in these genes from 8% to 100%. This benchmark will significantly improve the comprehensive characterization of these medically relevant genes and guide new method development.
The mitochondrial genome (mtDNA) is of interest for a range of fields including evolutionary, forensic, and medical genetics. Human mitogenomes can be classified into evolutionary related haplogroups that provide ancestral information and pedigree relationships. Because of this and the advent of high-throughput sequencing (HTS) technology, there is a diversity of bioinformatic tools for haplogroup classification. We present a benchmarking of the 11 most salient tools for human mtDNA classification using empirical whole-genome (WGS) and whole-exome (WES) short-read sequencing data from 36 unrelated donors. We also assessed the best performing tool in third-generation long noisy read WGS data obtained with nanopore technology for a subset of the donors. We found that, for short-read WGS, most of the tools exhibit high accuracy for haplogroup classification irrespective of the input file used for the analysis. However, for short-read WES, Haplocheck and MixEmt were the most accurate tools. Based on the performance shown for WGS and WES, and the accompanying qualitative assessment, Haplocheck stands out as the most complete tool. For third-generation HTS data, we also showed that Haplocheck was able to accurately retrieve mtDNA haplogroups for all samples assessed, although only after following assembly-based approaches (either based on a referenced-based assembly or a hybrid de novo assembly). Taken together, our results provide guidance for researchers to select the most suitable tool to conduct the mtDNA analyses from HTS data.
MiRNAs are important regulators of gene expression and are frequently deregulated under pathologic conditions. They are highly stable in bodily fluids which makes them feasible candidates to become minimally invasive biomarkers. In fact, several studies already proposed circulating miRNA-based biomarkers for different types of neoplastic, cardiovascular and degenerative diseases. However, many of these studies rely on small RNA sequencing experiments that are based on different RNA extraction and processing protocols, rendering results incomparable. We generated liqDB, a database for liquid biopsy small RNA sequencing profiles that provides users with meaningful information to guide their small RNA liquid biopsy research and to overcome technical and conceptual problems. By means of a user-friendly web interface, miRNA expression profiles from 1607 manually annotated samples can be queried and explored at different levels. Result pages include downloadable expression matrices, differential expression analysis, most stably expressed miRNAs, cluster analysis and relevant visualizations by means of boxplots and heatmaps. We anticipate that liqDB will be a useful tool in liquid biopsy research as it provides a consistently annotated large compilation of experiments together with tools for reproducible analysis, comparison and hypothesis generation. LiqDB is available at http://bioinfo5.ugr.es/liqdb.
Most causal variants of Mendelian diseases are exonic. Whole-exome sequencing (WES) has become the diagnostic gold standard, but causative variant prioritization constitutes a bottleneck. Here we assessed an in-house sample-to-sequence pipeline and benchmarked free prioritization tools for germline causal variants from WES data. WES of 61 unselected patients with a known genetic disease cause was obtained. Variant prioritizations were performed by diverse tools and recorded to obtain a diagnostic yield when the causal variant was present in the first, fifth, and 10th top rankings. A fraction of causal variants was not captured by WES (8.2%) or did not pass the quality control criteria (13.1%). Most of the applications inspected were unavailable or had technical limitations, leaving nine tools for complete evaluation. Exomiser performed best in the top first rankings, while LIRICAL led in the top fifth rankings. Based on the more conservative top 10th rankings, Xrare had the highest diagnostic yield, followed by a three-way tie among Exomiser, LIRICAL, and PhenIX, then followed by AMELIE, TAPES, Phen-Gen, AIVar, and VarNote-PAT. Xrare, Exomiser, LIRICAL, and PhenIX are the most efficient options for variant prioritization in real patient WES data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.