Abstract. One challenge for Linked Data is scalably establishing highquality owl:sameAs links between instances (e.g., people, geographical locations, publications, etc.) in different data sources. Traditional approaches to this entity coreference problem do not scale because they exhaustively compare every pair of instances. In this paper, we propose a candidate selection algorithm for pruning the search space for entity coreference. We select candidate instance pairs by computing a character-level similarity on discriminating literal values that are chosen using domain-independent unsupervised learning. We index the instances on the chosen predicates' literal values to efficiently look up similar instances. We evaluate our approach on two RDF and three structured datasets. We show that the traditional metrics don't always accurately reflect the relative benefits of candidate selection, and propose additional metrics. We show that our algorithm frequently outperforms alternatives and is able to process 1 million instances in under one hour on a single Sun Workstation. Furthermore, on the RDF datasets, we show that the entire entity coreference process scales well by applying our technique. Surprisingly, this high recall, low precision filtering mechanism frequently leads to higher F-scores in the overall system.
Cervical cancer is the second most common type of cancer for women. Existing screening programs for cervical cancer, such as Pap Smear, suffer from low sensitivity. Thus, many patients who are ill are not detected in the screening process. Using images of the cervix as an aid in cervical cancer screening has the potential to greatly improve sensitivity, and can be especially useful in resource-poor regions of the world. In this paper, we develop a data-driven computer algorithm for interpreting cervical images based on color and texture. We are able to obtain 74% sensitivity and 90% specificity when differentiating high-grade cervical lesions from low-grade lesions and normal tissue. On the same dataset, using Pap tests alone yields a sensitivity of 37% and specificity of 96%, and using HPV test alone gives a 57% sensitivity and 93% specificity. Furthermore, we develop a comprehensive algorithmic framework based on Multimodal Entity Coreference for combining various tests to perform disease classification and diagnosis. When integrating multiple tests, we adopt information gain and gradient-based approaches for learning the relative weights of different tests. In our evaluation, we present a novel algorithm that integrates cervical images, Pap, HPV, and patient age, which yields 83.21% sensitivity and 94.79% specificity, a statistically significant improvement over using any single source of information alone.
The objective of entity coreference is to determine if different mentions (e.g., person names, place names, database records, ontology instances, etc.) refer to the same real word object. Entity coreference algorithms can be used to detect duplicate database records and to determine if two Semantic Web instances represent the same underlying real word entity. The key issues in developing an entity coreference algorithm include how to locate context information and how to utilize the context appropriately. In this article, we present a novel entity coreference algorithm for ontology instances. For scalability reasons, we select a neighborhood of each instance from an RDF graph. To determine the similarity between two instances, our algorithm computes the similarity between comparable property values in the neighborhood graphs. The similarity of distinct URIs and blank nodes is computed by comparing their outgoing links. In an attempt to reduce the impact of distant nodes on the final similarity measure, we explore a distance-based discounting approach. To provide the best possible domain-independent matches, we propose an approach to compute the discriminability of triples in order to assign weights to the context information. We evaluated our algorithm using different instance categories from five datasets. Our experiments show that the best results are achieved by including both our discounting and triple discrimination approaches.
This paper presents the two systems we entered into the 2017 E2E NLG Challenge: TemplGen, a templated-based system and SeqGen, a neural network-based system. Through the automatic evaluation, SeqGen achieved competitive results compared to the template-based approach and to other participating systems as well. In addition to the automatic evaluation, in this paper we present and discuss the human evaluation results of our two systems.
In this paper, we present a novel entity coreference algorithm for Semantic Web instances. The key issues include how to locate context information and how to utilize the context appropriately. To collect context information, we select a neighborhood (consisting of triples) of each instance from the RDF graph. To determine the similarity between two instances, our algorithm computes the similarity between comparable property values in the neighborhood graphs. The similarity of distinct URIs and blank nodes is computed by comparing their outgoing links. To provide the best possible domain-independent matches, we examine an appropriate way to compute the discriminability of triples. To reduce the impact of distant nodes, we explore a distance-based discounting approach. We evaluated our algorithm using different instance categories in two datasets. Our experiments show that the best results are achieved by including both our triple discrimination and discounting approaches.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.