Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this article, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition. To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB .
Motivation: Identifying somatic changes from tumor and matched normal sequences has become a standard approach in cancer research. More specifically, this requires accurate detection of somatic point mutations with low allele frequencies in impure and heterogeneous cancer samples. Although haplotype phasing information derived by using heterozygous germ line variants near candidate mutations would improve accuracy, no somatic mutation caller that uses such information is currently available.Results: We propose a Bayesian hierarchical method, termed HapMuC, in which power is increased by using available information on heterozygous germ line variants located near candidate mutations. We first constructed two generative models (the mutation model and the error model). In the generative models, we prepared candidate haplotypes, considering a heterozygous germ line variant if available, and the observed reads were realigned to the haplotypes. We then inferred the haplotype frequencies and computed the marginal likelihoods using a variational Bayesian algorithm. Finally, we derived a Bayes factor for evaluating the possibility of the existence of somatic mutations. We also demonstrated that our algorithm has superior specificity and sensitivity compared with existing methods, as determined based on a simulation, the TCGA Mutation Calling Benchmark 4 datasets and data from the COLO-829 cell line.Availability and implementation: The HapMuC source code is available from http://github.com/usuyama/hapmuc.Contact: imoto@ims.u-tokyo.ac.jpSupplementary information: Supplementary data are available at Bioinformatics online.
Much of the AI work in healthcare is focused around disease prediction in clinical settings, which is an important application that has yet to deliver in earnest. However, there are other fundamental aspects like helping patients and care teams interact and communicate in efficient and meaningful ways, which could deliver quadruple-aim improvements. After heart disease and cancer, preventable medical errors are the third leading cause of death in the United States. The largest subset of medical errors is medication error. Providing the right treatment plan for patients includes knowledge about their current medications and drug allergies, an often challenging task. The widespread growth of prescribing and consuming medications has increased the need for applications that support medication reconciliation. We show a deep-learning application that can help reduce avoidable errors with their attendant risk, i.e., correctly identifying prescription medication, which is currently a tedious and error-prone task. We demonstrate prescription-pill identification from mobile images in the NIH NLM Pill Image Recognition Challenge dataset. Our application recognizes the correct pill within the top-5 results at 94% accuracy, which compares favorably to the original competition winner at 83.3% for top-5 under comparable, though not identical configurations. The Institute of Medicine claims that better use of information technology can be an important step in reducing medication errors. Therefore, we believe that a more immediate impact of AI in healthcare will occur with a seamless integration of AI into clinical workflows, readily addressing the quadruple aim of healthcare.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.