Dirk Groeneveld scite author profile

We describe a deployed scalable system for organizing published scientific literature into a heterogeneous graph to facilitate algorithmic manipulation and discovery. The resulting literature graph consists of more than 280M nodes, representing papers, authors, entities and various interactions between them (e.g., authorships, citations, entity mentions). We reduce literature graph construction into familiar NLP tasks (e.g., entity extraction and linking), point out research challenges due to differences from standard formulations of these tasks, and report empirical results for each task. The methods described in this paper are used to enable semantic features in www.semanticscholar.org.

show abstract

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Dodge¹,

Sap²,

Marasović³

et al. 2021

View full text Add to dashboard Cite

A Simple Yet Strong Pipeline for HotpotQA

Groeneveld¹,

Khot²,

Sabharwal³

2020

View full text Add to dashboard Cite

State-of-the-art models for multi-hop question answering typically augment large-scale language models like BERT with additional, intuitively useful capabilities such as named entity recognition, graph-based reasoning, and question decomposition.However, does their strong performance on popular multihop datasets really justify this added design complexity? Our results suggest that the answer may be no, because even our simple pipeline based on BERT, named QUARK, performs surprisingly well. Specifically, on Hot-potQA, QUARK outperforms these models on both question answering and support identification (and achieves performance very close to a RoBERTa model). Our pipeline has three steps: 1) use BERT to identify potentially relevant sentences independently of each other; 2) feed the set of selected sentences as context into a standard BERT span prediction model to choose an answer; and 3) use the sentence selection model, now with the chosen answer, to produce supporting sentences. The strong performance of QUARK resurfaces the importance of carefully exploring simple model designs before using popular benchmarks to justify the value of complex techniques.

show abstract

Construction of the Literature Graph in Semantic Scholar

Ammar¹,

Groeneveld²,

Bhagavatula³

et al. 2018

Preprint

View full text Add to dashboard Cite

From F to A on the New York Regents Science Exams — An Overview of the Aristo Project

et al. 2020

View full text Add to dashboard Cite

AI has achieved remarkable mastery over games such as Chess, Go, and Poker, and even Jeopardy!, but the rich variety of standardized exams has remained a landmark challenge. Even as recently as 2016, the best AI system could achieve merely 59.3 percent on an 8th grade science exam. This article reports success on the Grade 8 New York Regents Science Exam, where for the first time a system scores more than 90 percent on the exam’s nondiagram, multiple choice (NDMC) questions. In addition, our Aristo system, building upon the success of recent language models, exceeded 83 percent on the corresponding Grade 12 Science Exam NDMC questions. The results, on unseen test questions, are robust across different test years and different variations of this kind of test. They demonstrate that modern natural language processing methods can result in mastery on this task. While not a full solution to general question-answering (the questions are limited to 8th grade multiple-choice science) it represents a significant milestone for the field.

show abstract

From 'F' to 'A' on the N.Y. Regents Science Exams: An Overview of the Aristo Project

Clark¹,

Etzioni²,

Khashabi³

et al. 2019

Preprint

View full text Add to dashboard Cite

AI has achieved remarkable mastery over games such as Chess, Go, and Poker, and even Jeopardy!, but the rich variety of standardized exams has remained a landmark challenge. Even in 2016, the best AI system achieved merely 59.3% on an 8th Grade science exam challenge (Schoenick et al., 2016). This paper reports unprecedented success on the Grade 8 New York Regents Science Exam, where for the first time a system scores more than 90% on the exam's non-diagram, multiple choice (NDMC) questions. In addition, our Aristo system, building upon the success of recent language models, exceeded 83% on the corresponding Grade 12 Science Exam NDMC questions. The results, on unseen test questions, are robust across different test years and different variations of this kind of test. They demonstrate that modern NLP methods can result in mastery on this task. While not a full solution to general question-answering (the questions are multiple choice, and the domain is restricted to 8th Grade science), it represents a significant milestone for the field. * We gratefully acknowledge the late Paul Allen's inspiration, passion, and support for research on this grand challenge.1 See Section 4.1 for the experimental methodology.

show abstract

IKE - An Interactive Tool for Knowledge Extraction

Dalvi¹,

Bhakthavatsalam²,

Clark³

et al. 2016

View full text Add to dashboard Cite

Recent work on information extraction has suggested that fast, interactive tools can be highly effective; however, creating a usable system is challenging, and few publically available tools exist. In this paper we present IKE, a new extraction tool that performs fast, interactive bootstrapping to develop high-quality extraction patterns for targeted relations. Central to IKE is the notion that an extraction pattern can be treated as a search query over a corpus. To operationalize this, IKE uses a novel query language that is expressive, easy to understand, and fast to execute -essential requirements for a practical system. It is also the first interactive extraction tool to seamlessly integrate symbolic (boolean) and distributional (similarity-based) methods for search. An initial evaluation suggests that relation tables can be populated substantially faster than by manual pattern authoring while retaining accuracy, and more reliably than fully automated tools, an important step towards practical KB construction. We are making IKE publically available (http://allenai.org/ software/interactive-knowledge-extraction).

show abstract

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Dodge¹,

Sap²,

Marasović³

et al. 2021

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Dirk Groeneveld

Construction of the Literature Graph in Semantic Scholar

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

A Simple Yet Strong Pipeline for HotpotQA

Construction of the Literature Graph in Semantic Scholar

From F to A on the New York Regents Science Exams — An Overview of the Aristo Project

From 'F' to 'A' on the N.Y. Regents Science Exams: An Overview of the Aristo Project

IKE - An Interactive Tool for Knowledge Extraction

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Contact Info

Product

Resources

About