Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code. 1
We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. SpanBERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution. In particular, with the same training data and model size as BERTlarge, our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0 respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6% F1), strong performance on the TACRED relation extraction benchmark, and even gains on GLUE. 1
We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K questionanswer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. We show that, in comparison to other recently introduced large-scale datasets, TriviaQA (1) has relatively complex, compositional questions, (2) has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and (3) requires more cross sentence reasoning to find answers. We also present two baseline algorithms: a featurebased classifier and a state-of-the-art neural network, that performs well on SQuAD reading comprehension. Neither approach comes close to human performance (23% and 40% vs. 80%), suggesting that TriviaQA is a challenging testbed that is worth significant future study. 1
We apply BERT to coreference resolution, achieving strong improvements on the OntoNotes (+3.9 F1) and GAP (+11.5 F1) benchmarks. A qualitative analysis of model predictions indicates that, compared to ELMo and BERT-base, BERT-large is particularly better at distinguishing between related but distinct entities (e.g., President and CEO). However, there is still room for improvement in modeling document-level context, conversations, and mention paraphrasing. Our code and models are publicly available 1 .
We have developed a murine model of the Hematopoietic Syndrome of the Acute Radiation Syndrome (H-ARS) for efficacy testing of medical countermeasures (MCM) against radiation according to the FDA Animal Rule. Ten to 12 week old male and female C57BL/6 mice were exposed to the LD50/30-LD70/30 dose of total body irradiation (TBI, 137Cs, 0.62-0.67 Gy min-1) in the morning hours when mice were determined to be most radiosensitive, and assessed for 30 day survival and mean survival time (MST). Antibiotics were delivered in the drinking water on days 4-30 post-TBI at a concentration based on the amount of water that lethally-irradiated mice were found to consume. The fluoroquinolones, ciprofloxacin and levofloxacin, and the tetracycline doxycycline and aminoglycoside neomycin, all significantly increased MST of decedent mice, while ciprofloxacin (p=0.061) and doxycycline + neomycin (p=0.005) showed at least some efficacy to increase 30 day survival. Blood sampling (30uL/mouse every 5th day) was found to negatively impact 30 day survival. Histopathology of tissues harvested from non-moribund mice showed expected effects of lethal irradiation, while moribund mice were largely septicemic with a preponderance of enteric organisms. Kinetics of loss and recovery of peripheral blood cells in untreated mice and those treated with two MCM, granulocyte-colony stimulating factor and Amifostine, further characterized and validated our model for use in screening studies and pivotal efficacy studies of candidate MCM for licensure to treat irradiated individuals suffering from H-ARS.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.