Abstract:Entity Linking is the task of assigning entities from a Knowledge Base to textual mentions of such entities in a document. State-of-the-art approaches rely on lexical and statistical features which are abundant for popular entities but sparse for unpopular ones, resulting in a clear bias towards popular entities and poor accuracy for less popular ones. In this work, we present a novel approach that is guided by a natural notion of semantic similarity which is less amenable to such bias. We adopt a unified sema… Show more
“…Their approach was exclusively evaluated and optimized on the ACE2004, MSNBC and AQUAINT data sets on which the authors achieve state-of-the-art results. A direct comparison of our results and the results of [10] shows that both works perform equally well on the MSNBC data set. Furthermore, our approach performs better on the ACE2004 data set (0.906 vs. 0.877 F1) but loses on the AQUAINT data set (0.842 vs. 0.907 F1).…”
Section: Discussionmentioning
confidence: 51%
“…Anyhow, we use the work of Guo et al [10] as an entry point in the following. Their approach was exclusively evaluated and optimized on the ACE2004, MSNBC and AQUAINT data sets on which the authors achieve state-of-the-art results.…”
Section: Discussionmentioning
confidence: 99%
“…If the underlying KB has a lower number of entities, the average likelihood of a wrong disambiguation is also reduced. In order to compare our algorithm with the approach in [10], we introduce the concept of the Surface Form Ambiguity Degree (SFAD). The SFAD is based on two assumptions: First, both approaches are able to disambiguate all entities in the ground truth data set, i.e.…”
Section: Discussionmentioning
confidence: 99%
“…It incorporates, along with statistical methods, richer relational analysis of the text. In 2014, the authors Guo et al [10] proposed the use of a probability distribution resulting from a random walk with restart over a suitable entity graph to represent the semantics of entities and documents in a unified way. Their algorithm updates the semantic signature of the document as surface forms are disambiguated.…”
Section: Related Workmentioning
confidence: 99%
“…Request permissions from permissions@acm.org. 12,15], most approaches have been optimised to work on a particular type of disambiguation task, like for example on short Twitter messages [2], web pages [12,28], news documents [10,15], encyclopedias [16,11,5], RSS-Feeds [9] etc. While most authors report to outperform other entity disambiguation algorithms on their domain/data set, they do not achieve comparable accuracy on other domains.…”
Entity disambiguation is the task of mapping ambiguous terms in natural-language text to its entities in a knowledge base. It finds its application in the extraction of structured data in RDF (Resource Description Framework) from textual documents, but equally so in facilitating artificial intelligence applications, such as Semantic Search, Reasoning and Question & Answering. We propose a new collective, graph-based disambiguation algorithm utilizing semantic entity and document embeddings for robust entity disambiguation. Robust thereby refers to the property of achieving better than state-of-the-art results over a wide range of very different data sets. Our approach is also able to abstain if no appropriate entity can be found for a specific surface form. Our evaluation shows, that our approach achieves significantly (>5%) better results than all other publicly available disambiguation algorithms on 7 of 9 datasets without data set specific tuning. Moreover, we discuss the influence of the quality of the knowledge base on the disambiguation accuracy and indicate that our algorithm achieves better results than non-publicly available state-of-the-art algorithms.
“…Their approach was exclusively evaluated and optimized on the ACE2004, MSNBC and AQUAINT data sets on which the authors achieve state-of-the-art results. A direct comparison of our results and the results of [10] shows that both works perform equally well on the MSNBC data set. Furthermore, our approach performs better on the ACE2004 data set (0.906 vs. 0.877 F1) but loses on the AQUAINT data set (0.842 vs. 0.907 F1).…”
Section: Discussionmentioning
confidence: 51%
“…Anyhow, we use the work of Guo et al [10] as an entry point in the following. Their approach was exclusively evaluated and optimized on the ACE2004, MSNBC and AQUAINT data sets on which the authors achieve state-of-the-art results.…”
Section: Discussionmentioning
confidence: 99%
“…If the underlying KB has a lower number of entities, the average likelihood of a wrong disambiguation is also reduced. In order to compare our algorithm with the approach in [10], we introduce the concept of the Surface Form Ambiguity Degree (SFAD). The SFAD is based on two assumptions: First, both approaches are able to disambiguate all entities in the ground truth data set, i.e.…”
Section: Discussionmentioning
confidence: 99%
“…It incorporates, along with statistical methods, richer relational analysis of the text. In 2014, the authors Guo et al [10] proposed the use of a probability distribution resulting from a random walk with restart over a suitable entity graph to represent the semantics of entities and documents in a unified way. Their algorithm updates the semantic signature of the document as surface forms are disambiguated.…”
Section: Related Workmentioning
confidence: 99%
“…Request permissions from permissions@acm.org. 12,15], most approaches have been optimised to work on a particular type of disambiguation task, like for example on short Twitter messages [2], web pages [12,28], news documents [10,15], encyclopedias [16,11,5], RSS-Feeds [9] etc. While most authors report to outperform other entity disambiguation algorithms on their domain/data set, they do not achieve comparable accuracy on other domains.…”
Entity disambiguation is the task of mapping ambiguous terms in natural-language text to its entities in a knowledge base. It finds its application in the extraction of structured data in RDF (Resource Description Framework) from textual documents, but equally so in facilitating artificial intelligence applications, such as Semantic Search, Reasoning and Question & Answering. We propose a new collective, graph-based disambiguation algorithm utilizing semantic entity and document embeddings for robust entity disambiguation. Robust thereby refers to the property of achieving better than state-of-the-art results over a wide range of very different data sets. Our approach is also able to abstain if no appropriate entity can be found for a specific surface form. Our evaluation shows, that our approach achieves significantly (>5%) better results than all other publicly available disambiguation algorithms on 7 of 9 datasets without data set specific tuning. Moreover, we discuss the influence of the quality of the knowledge base on the disambiguation accuracy and indicate that our algorithm achieves better results than non-publicly available state-of-the-art algorithms.
Digital libraries are online collections of digital objects that can include text, images, audio, or videos. It has long been observed that named entities (NEs) are key to the access to digital library portals as they are contained in most user queries. Combined or subsequent to the recognition of NEs, named entity linking (NEL) connects NEs to external knowledge bases. This allows to differentiate ambiguous geographical locations or names (John Smith), and implies that the descriptions from the knowledge bases can be used for semantic enrichment. However, the NEL task is especially challenging for large quantities of documents as the diversity of NEs is increasing with the size of the collections. Additionally digitized documents are indexed through their OCRed version which may contains numerous OCR errors. This paper aims to evaluate the performance of named entity linking over digitized documents with different levels of OCR quality. It is the first investigation that we know of to analyze and correlate the impact of document degradation on the performance of NEL. We tested state-of-the-art NEL techniques over several evaluation benchmarks, and experimented with various types of OCR noise. We present the resulting study and subsequent recommendations on the adequate documents and OCR quality levels required to perform reliable named entity linking. We further provide the first evaluation benchmark for NEL over degraded documents.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.