The primary literature on human genetic diseases with high penetrance includes descriptions of large numbers of pathogenic variants that can be essential for clinical diagnosis. Variant databases such as ClinVar and HGMD collect pathogenic variants by manual curation of either voluntary submissions or the published literature. AVADA (Automatically curated VAriant DAtabase) represents the first automated tool designed to construct a comprehensive database of highly penetrant genetic variants directly from full-text articles about human genetic disease. 2 AVADA was able to automatically curate almost 60% of the pathogenic variants deposited in 24 HGMD, over 4 times more than approaches parsing only PubMed abstracts. AVADA also 25 contains more than 60,000 pathogenic variants that are in HGMD, but not in ClinVar. Despite 26 being fully automated, 9 of AVADA's top 10 yielding journals are shared with HGMD's top 10, 27 and its mutation type distribution strongly resembles that of both HGMD and ClinVar. We demonstrate the utility of AVADA in clinical practice on a cohort of 245 patients with already diagnosed genetic diseases. Out of 260 causative variants originally reported for these patients, AVADA contained 38 variants described in the literature prior to publication of the patient cohort, compared to 43 using HGMD, 20 using ClinVar and only 13 (wholly subsumed by AVADA's) using an automated abstracts-only based approach. The database of automatically curated variants will be made available upon publication at http://bejerano.stanford.edu/AVADA. 35
2 AbstractPurpose: The primary literature on human genetic diseases includes descriptions of pathogenic variants that are essential for clinical diagnosis. Variant databases such as ClinVar and HGMD collect pathogenic variants by manual curation. We aimed to automatically construct a freely accessible database of pathogenic variants directly from full-text articles about genetic disease.Methods: AVADA (Automatically curated VAriant DAtabase) is a novel machine learning tool that uses natural language processing to automatically identify pathogenic variants and genes in full text of primary literature and converts them to genomic coordinates for rapid downstream use.Results: AVADA automatically curated almost 60% of pathogenic variants deposited in HGMD, a 4.4-fold improvement over the current state of the art in automated variant extraction. AVADA also contains more than 60,000 pathogenic variants that are in HGMD, but not in ClinVar. In a cohort of 245 diagnosed patients, AVADA correctly annotated 38 previously described diagnostic variants, compared to 43 using HGMD, 20 using ClinVar and only 13 (wholly subsumed by AVADA and ClinVar's) using the best automated abstracts-only based approach.Conclusion: AVADA is the first machine learning tool that automatically curates a variants database directly from full text literature. AVADA is available upon publication at http://bejerano.stanford.edu/AVADA.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.