Blog / Research

How does ask a question work?

Ask a question is an extractive question answering system on over 1.2bn citation statements.
Wed Nov 16 2022

Recently we released an early preview of a question answering (QA) system built on top of our data (see https://scite.ai/data-and-services for a more comprehensive picture of what our data is). This short article will explain how it currently works as of November 16th 2022 and a little bit about our current plans for expanding it

This is a fairly technical post so feel free to reach out to us at hi@scite.ai if you have questions.

Why build a scientific question answering system?

While we have a pretty powerful search system that already allows you to search 1.2bn citation statements extracted from 32 million full texts (both closed and open access!)... search isn't always the easiest way to get after the kinds of information you are looking for. This is especially true since we use a "lexical" (read word matching)-based search system using elasticsearch - so if you don't formulate your search query exactly with the words you want matched you might not get what you are looking for. In addition, much of the "information seeking" experiences we have when looking at science comes in the form of questions we need answers to. Therefore, we thought it might help our users discover science-backed answers if we build a QA system over all our data!

We are building a large-scale scientific question answering system to help users get science-based answers to questions.

What is question answering?

Question answering is a popular task in the natural language processing community that seeks to provide an answer for a given question posed by the user. There are plenty of formulations of question answering from multiple choice, closed book, and extractive qa to open domain and multi-hop qa. It is a very popular task; for a great survey on the recent explosion of datasets alone see QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension.

Our approach is open-domain extractive question answering. This means that we are looking up a question in an open domain of scientific full texts and then extracting an answer from those texts as snippets of text. A nice lay person's overview of what the extractive part of this looks like is provided in Huggingface Transformers Course - Question answering

How is this different from other systems?

Other than maybe some technical details presented in the next section the main difference is that we are using citation statements (the 3-5 sentences where a reference is used in-text) extracted from full-texts as our primary source for answering questions. We do use abstracts as well, but there is significant evidence that citation statements provide a lot of great information from summarizing the claims of an author, providing factoid answers, and surfacing criticism or supporting evidence for a particular answer. For more details on this see Do peers see more in a paper than its authors?. Currently we are answering questions from over 1.2bn citation statements extracted from 32 million full texts which are from both closed and open access sources as well as about 48 million abstracts.

How does it actually work?

We currently started with a very basic approach to open-domain extractive QA.

The system works like this:

  • We process your question by removing stop words and punctuation to form a query.
  • We retrieve the top 200 results from elasticsearch using that query over our 1.2bn citation statements and 48 million abstracts.
  • We rerank the results with the original question using a cross-encoder trained on MS-MARCO (we use 'cross-encoder/ms-marco-MiniLM-L-12-v2' available on sentence transformers )
  • We use a page length of 20 results and run our extractive question answering model trained on squad2, natural questions, and bioasq (see our model for details).
  • We return the answers to you!

Our models are available here: https://huggingface.co/scite

Specifically we use the ONNX optimized (scite/roberta-base-squad2-nq-bioasq-optimized-gpu and scite/ms-marco-MiniLM-L-12-v2-onnx-optimized-gpu)

We are happy to provide more details on how we built this if you are interested at hi@scite.ai - We didn't write anything formal yet since we are still deciding on our approaches - once we have settled on an approach we will write up a paper fully describing it and releasing final assets.

How does it perform?

This is actually quite an open question! We know how ms-marco-MiniLM-L-12-v2 performs on MS-MARCO (MRR@10 on MS Marco Dev Set: 39.02). We know how our model performs on SQUAD2+BIOASQ at 88.5% exact match (f1 93.3%).

But citation statements may be quite a different domain than these and besides individual model performance doesn't indicate how the approach performs end to end.

We are currently in the process for developing a benchmark for assessing performance end to end so stay tuned!

Future Plans?

Our future plans will largely be driven by user feedback - so far this is mostly improving the UX and UI and adding controls like filtering by year and so on. Beyond this some concrete near-term plans though are:

  • Develop an approach to benchmark end to end performance on our corpus
  • Expand the number of datasets we use for training our QA model (We only use Squad2, Natural Questions, and BioASQ but there are so many more and even ones in the scientific domain)
  • Train our reranker on in-domain ranking data (MS-MARCO is not a science specific reranker)
  • Explore more approaches to query expansion (since we still ulitimately use lexical search with elastic search) or retrieval augmentation

Some longer-term plans are to extend the extractive tasks beyond question answering and make the answers more informative for non-experts.

Feel free to reach out to us if you have any questions or feedback at hi@scite.ai