Recently we released an early preview of a question answering (QA) system built on top of our data (see https://scite.ai/data-and-services for a more comprehensive picture of what our data is). This short article will explain how it currently works as of November 16th 2022 and a little bit about our current plans for expanding it
This is a fairly technical post so feel free to reach out to us at email@example.com if you have questions.
While we have a pretty powerful search system that already allows you to search 1.2bn citation statements extracted from 32 million full texts (both closed and open access!)... search isn't always the easiest way to get after the kinds of information you are looking for. This is especially true since we use a "lexical" (read word matching)-based search system using elasticsearch - so if you don't formulate your search query exactly with the words you want matched you might not get what you are looking for. In addition, much of the "information seeking" experiences we have when looking at science comes in the form of questions we need answers to. Therefore, we thought it might help our users discover science-backed answers if we build a QA system over all our data!
We are building a large-scale scientific question answering system to help users get science-based answers to questions.
Question answering is a popular task in the natural language processing community that seeks to provide an answer for a given question posed by the user. There are plenty of formulations of question answering from multiple choice, closed book, and extractive qa to open domain and multi-hop qa. It is a very popular task; for a great survey on the recent explosion of datasets alone see QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension.
Our approach is open-domain extractive question answering. This means that we are looking up a question in an open domain of scientific full texts and then extracting an answer from those texts as snippets of text. A nice lay person's overview of what the extractive part of this looks like is provided in Huggingface Transformers Course - Question answering
Other than maybe some technical details presented in the next section the main difference is that we are using citation statements (the 3-5 sentences where a reference is used in-text) extracted from full-texts as our primary source for answering questions. We do use abstracts as well, but there is significant evidence that citation statements provide a lot of great information from summarizing the claims of an author, providing factoid answers, and surfacing criticism or supporting evidence for a particular answer. For more details on this see Do peers see more in a paper than its authors?. Currently we are answering questions from over 1.2bn citation statements extracted from 32 million full texts which are from both closed and open access sources as well as about 48 million abstracts.
We currently started with a very basic approach to open-domain extractive QA.
The system works like this:
Our models are available here: https://huggingface.co/scite
Specifically we use the ONNX optimized (scite/roberta-base-squad2-nq-bioasq-optimized-gpu and scite/ms-marco-MiniLM-L-12-v2-onnx-optimized-gpu)
We are happy to provide more details on how we built this if you are interested at firstname.lastname@example.org - We didn't write anything formal yet since we are still deciding on our approaches - once we have settled on an approach we will write up a paper fully describing it and releasing final assets.
This is actually quite an open question! We know how ms-marco-MiniLM-L-12-v2 performs on MS-MARCO (MRR@10 on MS Marco Dev Set: 39.02). We know how our model performs on SQUAD2+BIOASQ at 88.5% exact match (f1 93.3%).
But citation statements may be quite a different domain than these and besides individual model performance doesn't indicate how the approach performs end to end.
We are currently in the process for developing a benchmark for assessing performance end to end so stay tuned!
Our future plans will largely be driven by user feedback - so far this is mostly improving the UX and UI and adding controls like filtering by year and so on. Beyond this some concrete near-term plans though are:
Some longer-term plans are to extend the extractive tasks beyond question answering and make the answers more informative for non-experts.
Feel free to reach out to us if you have any questions or feedback at email@example.com