Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2018
DOI: 10.18653/v1/d18-1258
|View full text |Cite
|
Sign up to set email alerts
|

emrQA: A Large Corpus for Question Answering on Electronic Medical Records

Abstract: We propose a novel methodology to generate domain-specific large-scale question answering (QA) datasets by re-purposing existing annotations for other NLP tasks. We demonstrate an instance of this methodology in generating a large-scale QA dataset for electronic medical records by leveraging existing expert annotations on clinical notes for various NLP tasks from the community shared i2b2 datasets § . The resulting corpus (emrQA) has 1 million question-logical form and 400,000+ questionanswer evidence pairs. W… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
140
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 128 publications
(143 citation statements)
references
References 41 publications
(57 reference statements)
0
140
0
Order By: Relevance
“…We followed the SQuAD 2.0 task setting, because it can be critical to have the system refrain from making false suggestions especially in some clinical applications. >> Figure 1 emrQA The emrQA [6] is a large training set annotated for RCQA in the clinical domain. It was generated by template-based semantic extraction from the i2b2 NLP challenge datasets [7].…”
Section: Squadmentioning
confidence: 99%
“…We followed the SQuAD 2.0 task setting, because it can be critical to have the system refrain from making false suggestions especially in some clinical applications. >> Figure 1 emrQA The emrQA [6] is a large training set annotated for RCQA in the clinical domain. It was generated by template-based semantic extraction from the i2b2 NLP challenge datasets [7].…”
Section: Squadmentioning
confidence: 99%
“…However, if we were to scale our approach to real-world application, we would require external data. Therefore for future work, given more time, we would like to use external datasets such as emrQA (Pampari et al, 2018) and explore multi-task learning due to the similarity of the three tasks and aim to incorporate other medical tasks for a better generalisation of the biomedical question answering. We would also want to train the BERT models on biomedical-focused vocabulary and additional data in the future as a baseline to compare against multi-task learning.…”
Section: Question Answering Baseline System Problemsmentioning
confidence: 99%
“…The patient's notes were then loaded into an annotation tool for them to mark answer text spans. Pampari, Raghavan, Liang, & Peng (2018) developed emrQA, a large clinical QA corpus generated through template-based semantic extraction from the i2b2 NLP challenge datasets. * The emrQA contains 7.5% of why-QAs, but they mainly ask about why the patient received a test or treatment, due to the partial interest of the original challenge annotations.…”
Section: Annotating and Characterizing Clinical Sentences With Explicmentioning
confidence: 99%