Fine-Grained Entity Typing for Domain Independent Entity Linking

Onoe, Yasumasa; Durrett, Greg

doi:10.1609/aaai.v34i05.6380

Cited by 64 publications

(100 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The expense and complexity of obtaining expert annotations of medical information is frequently cited as a major barrier to advancing machine learning-based technologies in medicine (67,68). While our approach did require expert-annotated data, we were able to achieve strong coding performance using a relatively small dataset of only 400 clinical documents, compared to the thousands of documents used in a recent study on extracting evidence of geriatric syndrome (28) or the tens of thousands used in foundational NLP research (69). Datasets of similar scale have been developed for automatic coding of other types of medical information (70), indicating that for a new type of health information, an initial dataset of a few hundred documents is likely to provide significant signal for machine learning.…”

Section: A Template For Expanding Automated Coding To New Concept Dommentioning

confidence: 94%

Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health

Newman-Griffis¹,

Fosler‐Lussier²

2021

Front. Digit. Health

View full text Add to dashboard Cite

Linking clinical narratives to standardized vocabularies and coding systems is a key component of unlocking the information in medical text for analysis. However, many domains of medical concepts, such as functional outcomes and social determinants of health, lack well-developed terminologies that can support effective coding of medical text. We present a framework for developing natural language processing (NLP) technologies for automated coding of medical information in under-studied domains, and demonstrate its applicability through a case study on physical mobility function. Mobility function is a component of many health measures, from post-acute care and surgical outcomes to chronic frailty and disability, and is represented as one domain of human activity in the International Classification of Functioning, Disability, and Health (ICF). However, mobility and other types of functional activity remain under-studied in the medical informatics literature, and neither the ICF nor commonly-used medical terminologies capture functional status terminology in practice. We investigated two data-driven paradigms, classification and candidate selection, to link narrative observations of mobility status to standardized ICF codes, using a dataset of clinical narratives from physical therapy encounters. Recent advances in language modeling and word embedding were used as features for established machine learning models and a novel deep learning approach, achieving a macro-averaged F-1 score of 84% on linking mobility activity reports to ICF codes. Both classification and candidate selection approaches present distinct strengths for automated coding in under-studied domains, and we highlight that the combination of (i) a small annotated data set; (ii) expert definitions of codes of interest; and (iii) a representative text corpus is sufficient to produce high-performing automated coding systems. This research has implications for continued development of language technologies to analyze functional status information, and the ongoing growth of NLP tools for a variety of specialized applications in clinical care and research.

show abstract

Section: A Template For Expanding Automated Coding To New Concept Dommentioning

confidence: 94%

Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health

Newman-Griffis¹,

Fosler‐Lussier²

2021

Front. Digit. Health

View full text Add to dashboard Cite

show abstract

“…In contrast, we use the category relations directly without requiring such additional steps. Onoe and Durrett (2020) use the direct parent categories of hyperlinks for training entity linking systems.…”

Section: Related Workmentioning

confidence: 99%

Mining Knowledge for Natural Language Inference from Wikipedia Categories

Chen

Chu

Stratos

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Accurate lexical entailment (LE) and natural language inference (NLI) often require large quantities of costly annotations. To alleviate the need for labeled data, we introduce WIKINLI: a resource for improving model performance on NLI and LE tasks. It contains 428,899 pairs of phrases constructed from naturally annotated category hierarchies in Wikipedia. We show that we can improve strong baselines such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) by pretraining them on WIKINLI and transferring the models on downstream tasks. We conduct systematic comparisons with phrases extracted from other knowledge bases such as WordNet and Wikidata to find that pretraining on WIKINLI gives the best performance. In addition, we construct WIKINLI in other languages, and show that pretraining on them improves performance on NLI tasks of corresponding languages. 1 * Equal contribution. Listed in alphabetical order. 1 Code and data are available at https://github. com/ZeweiChu/WikiNLI.

show abstract

“…In this work, we explore a set of interpretable entity representations that are simultaneously human and machine readable. The key idea of this approach is to use fine-grained entity typing models with large type inventories (Ling and Weld, 2012;Gillick et al, 2014;Choi et al, 2018;Onoe and Durrett, 2020). Given an entity mention and context words, our typing model outputs a highdimensional vector whose values are associated with predefined fine-grained entity types.…”

Section: Introductionmentioning

confidence: 99%

“…Each value ranges between 0 and 1, corresponding to the confidence of the model's decision that the entity has the property given by the corresponding type. We use pre-trained Transformer-based entity typing models, trained either on a supervised entity typing dataset (Choi et al, 2018) or on a distantlysupervised dataset derived from Wikipedia categories (Onoe and Durrett, 2020). The type vectors from these models, which contain tens of thousands of types, are then used as contextualized entity embeddings in downstream tasks.…”

Section: Introductionmentioning

confidence: 99%

“…Past work has shown that such type-driven representations are useful for entity linking (Onoe and Durrett, 2020); we improve the quality of these representations, broaden the scope of where they can be applied, and show techniques to extend and debug them by exploiting their interpretable na- (1) A mention and its context are fed into (2) an embedding model.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Interpretable Entity Representations through Large-Scale Typing

Onoe

Durrett

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

Self Cite

View full text Add to dashboard Cite

In standard methodology for natural language processing, entities in text are typically embedded in dense vector spaces with pre-trained models. The embeddings produced this way are effective when fed into downstream models, but they require end-task fine-tuning and are fundamentally difficult to interpret. In this paper, we present an approach to creating entity representations that are human readable and achieve high performance on entity-related tasks out of the box. Our representations are vectors whose values correspond to posterior probabilities over finegrained entity types, indicating the confidence of a typing model's decision that the entity belongs to the corresponding type. We obtain these representations using a fine-grained entity typing model, trained either on supervised ultra-fine entity typing data (Choi et al., 2018) or distantly-supervised examples from Wikipedia. On entity probing tasks involving recognizing entity identity, our embeddings used in parameter-free downstream models achieve competitive performance with ELMoand BERT-based embeddings in trained models. We also show that it is possible to reduce the size of our type set in a learning-based way for particular domains. Finally, we show that these embeddings can be post-hoc modified through a small number of rules to incorporate domain knowledge and improve performance.

show abstract

Fine-Grained Entity Typing for Domain Independent Entity Linking

Cited by 64 publications

References 15 publications

Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health

Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health

Mining Knowledge for Natural Language Inference from Wikipedia Categories

Interpretable Entity Representations through Large-Scale Typing

Contact Info

Product

Resources

About