Many databases contain imprecise references to real-world entities. For example, a social-network database records names of people. But different people can go by the same name and there may be different observed names referring to the same person. The goal of entity resolution is to determine the mapping from database references to discovered real-world entities.Traditional entity resolution approaches consider approximate matches between attributes of individual references, but this does not always work well. In many domains, such as social networks and academic circles, the underlying entities exhibit strong ties to each other, and as a result, their references often co-occur in the data. In this dissertation, I focus on the use of such co-occurrence relationships for jointly resolving entities. I refer to this problem as 'collective entity resolution'.First, I propose a relational clustering algorithm for iteratively discovering entities by clustering references taking into account the clusters of co-occurring references.Next, I propose a probabilistic generative model for collective resolution that finds hidden group structures among the entities and uses the latent groups as evidence for entity resolution. One of my contributions is an efficient unsupervised inference algorithm for this model using Gibbs Sampling techniques that discovers the most likely number of entities. Both of these approaches improve performance over attribute-only baselines in multiple real world and synthetic datasets. I also perform a theoretical analysis of how the structural properties of the data affect collective entity resolution and verify the predicted trends experimentally. In addition, I motivate the problem of query-time entity resolution. I propose an adaptive algorithm that uses collective resolution for answering queries by recursively exploring and resolving related references. This enables resolution at query-time, while preserving the performance benefits of collective resolution. Finally, as an application of entity resolution in the domain of natural language processing, I study the sense disambiguation problem and propose models for collective sense disambiguation using multiple languages that outperform other unsupervised approaches.
In this paper, we address the problem of entity resolution, where given many references to underlying objects, the task is to predict which references correspond to the same object. We propose a probabilistic model for collective entity resolution. Our approach differs from other recently proposed entity resolution approaches in that it is a) unsupervised, b) generative and c) introduces a hidden 'group' variable to capture collections of entities which are commonly observed together. The entity resolution decisions are not considered on an independent pairwise basis, but instead decisions are made collectively. We focus on how the use of relational links among the references can be exploited. We show how we can use Gibbs Sampling to infer the collaboration groups and the entities jointly from the observed co-author relationships among entity references and how this improves entity resolution performance. We demonstrate the utility of our approach on two real-world bibliographic datasets. In addition, we present preliminary results on characterizing conditions under which collaborative information is useful.
Record linkage, the problem of determining when two records refer to the same entity, has applications for both data cleaning (deduplication) and for integrating data from multiple sources. Traditional approaches use a similarity measure that compares tuples' attribute values; tuples with similarity scores above a certain threshold are declared to be matches. While this method can perform quite well in many domains, particularly domains where there is not a large amount of noise in the data, in some domains looking only at tuple values is not enough. By also examining the context of the tuple, i.e. the other tuples to which it is linked, we can come up with a more accurate linkage decision. But this additional accuracy comes at a price. In order to correctly find all duplicates, we may need to make multiple passes over the data; as linkages are discovered, they may in turn allow us to discover additional linkages. We present results that illustrate the power and feasibility of making use of join information when comparing records.
Recent work has addressed the problem of detecting relevant claims for a given controversial topic. We introduce the complementary task of claim stance classification, along with the first benchmark dataset for this task. We decompose this problem into: (a) open-domain target identification for topic and claim (b) sentiment classification for each target, and (c) open-domain contrast detection between the topic and the claim targets. Manual annotation of the dataset confirms the applicability and validity of our model. We describe an implementation of our model, focusing on a novel algorithm for contrast detection. Our approach achieves promising results, and is shown to outperform several baselines, which represent the common practice of applying a single, monolithic classifier for stance classification.
PurposeThe purpose of this paper is to make a strong case for investing in information and communication technologies (ICT) for building up of quality human resource capital for economic upliftment of India. An attempt has been made to explore the possibilities of online learning (OL)/e‐learning towards building up of quality human resources in higher education for a developing nation like India. A comprehensive environmental scanning of various e‐learning experiments, tools, projects to facilitate e‐learning or various institutional level efforts has been carried out. The paper also seeks to highlight the options available with traditional institutes for deploying ICT and for implementing e‐learning.Design/methodology/approachThe paper is a descriptive account of the contemporary situation in India with regard to education especially e‐learning and draws on a variety of secondary sources both published and unpublished.FindingsArgues that the development of e‐learning has been limited and reasons out why. The challenges of traditional face‐to‐face education vis‐à‐vis e‐learning in India are enlisted and suggestions for management of the e‐learning process by institutes which intend to venture into e‐learning are enumerated. The paper advocates the urgency for the traditional institutions to put an impetus on investment in ICT for providing e‐instruction for delivery of knowledge by riding the information super highway.Research limitations/implicationsPresents a review of literature developed from secondary sources.Practical implicationsModels of e‐learning that exclude any face‐to‐face contact may have limited prospects, but blended learning offers significant potential both on and off campus and should be pursued if the benefits of e‐learning are to be fully realized.Originality/valueThis paper provides a useful overview of a scenario of OL/e‐learning in India's higher education; and, from this summary of the present situation, goes on to suggest possible ways to transform the “digital divide” into “digital opportunities”.
Facet-based sentiment analysis involves discovering the latent facets, sentiments and their associations. Traditional facet-based sentiment analysis algorithms typically perform the various tasks in sequence, and fail to take advantage of the mutual reinforcement of the tasks. Additionally, inferring sentiment levels typically requires domain knowledge or human intervention. In this paper, we propose a series of probabilistic models that jointly discover latent facets and sentiment topics, and also order the sentiment topics with respect to a multi-point scale, in a language and domain independent manner. This is achieved by simultaneously capturing both short-range syntactic structure and long range semantic dependencies between the sentiment and facet words. The models further incorporate coherence in reviews, where reviewers dwell on one facet or sentiment level before moving on, for more accurate facet and sentiment discovery. For reviews which are supplemented with ratings, our models automatically order the latent sentiment topics, without requiring seed-words or domain-knowledge. To the best of our knowledge, our work is the first attempt to combine the notions of syntactic and semantic dependencies in the domain of review mining. Further, the concept of facet and sentiment coherence has not been explored earlier either. Extensive experimental results on real world review data show that the proposed models outperform various state of the art baselines for facet-based sentiment analysis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.