Abstract:Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a dif cult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats or any combination of these factors. In this article, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar eld… Show more
“…Surveys [8,9]. review the various approaches, including named attributes computations [5], schema mapping [2,17] and duplicate detection in hierarchical data [10], all which inform the construction of profile linkage techniques.…”
Section: Record Linkage and Entity Resolutionmentioning
Abstract. Piecing together social signals from people in different online social networks is key for downstream analytics. However, users may have different usernames in different social networks, making the linkage task difficult. To enable this, we explore a probabilistic approach that uses a domain-specific prior knowledge to address this problem of online social network user profile linkage. At scale, linkage approaches that are based on a naïve pairwise comparisons that have quadratic complexity become prohibitively expensive. Our proposed threshold-based canopying framework -named OPL -reduces this pairwise comparisons, and guarantees a upper bound theoretic linear complexity with respect to the dataset size. We evaluate our approaches on real-world, large-scale datasets obtained from Twitter and Linkedin. Our probabilistic classifier integrating prior knowledge into Naïve Bayes performs at over 85% F1-measure for pairwise linkage, comparable to state-of-the-art approaches.
“…Surveys [8,9]. review the various approaches, including named attributes computations [5], schema mapping [2,17] and duplicate detection in hierarchical data [10], all which inform the construction of profile linkage techniques.…”
Section: Record Linkage and Entity Resolutionmentioning
Abstract. Piecing together social signals from people in different online social networks is key for downstream analytics. However, users may have different usernames in different social networks, making the linkage task difficult. To enable this, we explore a probabilistic approach that uses a domain-specific prior knowledge to address this problem of online social network user profile linkage. At scale, linkage approaches that are based on a naïve pairwise comparisons that have quadratic complexity become prohibitively expensive. Our proposed threshold-based canopying framework -named OPL -reduces this pairwise comparisons, and guarantees a upper bound theoretic linear complexity with respect to the dataset size. We evaluate our approaches on real-world, large-scale datasets obtained from Twitter and Linkedin. Our probabilistic classifier integrating prior knowledge into Naïve Bayes performs at over 85% F1-measure for pairwise linkage, comparable to state-of-the-art approaches.
“…Entity Linkage is the process that decides whether two descriptions refer to the same real world entity (see [12] for an overview). Actually, state-of-the-art methods from this area have also been reused and adapted in implementing entity search.…”
Abstract. We present the Entity Name System (ENS), an enabling infrastructure, which can host descriptions of named entities and provide unique identifiers, on large-scale. In this way, it opens new perspectives to realize entity-oriented, rather than keyword-oriented, Web information systems. We describe the architecture and the functionality of the ENS, along with tools, which all contribute to realize the Web of entities.
“…PowerMap uses the Watson 5 semantic search engine as a gateway to the SW. In addition, PowerMap can also query its own repositories and offers the capability to index and add new online ontologies 6 . In the third step, the Triple Similarity Service (TSS) matches the QTs to ontological expressions.…”
Section: Motivating Scenario: Question Answering On the Semantic Webmentioning
confidence: 99%
“…Basic similarity metrics based on string comparison were developed in the database community (e.g., [16,3]). These metrics are used as a basis for the majority of algorithms, which compare values of attributes of different data instances and aggregate them to make a decision about two instances referring to the same entity (see [6] for a survey). The main distinction of our work is that, in the PowerAqua scenario, the fusion of answers is done in real time.…”
Abstract. In this paper we propose algorithms for combining and ranking answers from distributed heterogeneous data sources in the context of a multi-ontology Question Answering task. Our proposal includes a merging algorithm that aggregates, combines and filters ontology-based search results and three different ranking algorithms that sort the final answers according to different criteria such as popularity, confidence and semantic interpretation of results. An experimental evaluation on a large scale corpus indicates improvements in the quality of the search results with respect to a scenario where the merging and ranking algorithms were not applied. These collective methods for merging and ranking allow to answer questions that are distributed across ontologies, while at the same time, they can filter irrelevant answers, fuse similar answers together, and elicit the most accurate answer(s) to a question.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.