Nowadays, searches for the web pages of a person with a given name constitute a notable fraction of queries to Web search engines. Such a query would normally return web pages related to several namesakes, who happened to have the queried name, leaving the burden of disambiguating and collecting pages relevant to a particular person (from among the namesakes) on the user. In this paper, we develop a Web People Search approach that clusters web pages based on their association to different people. Our method exploits a variety of semantic information extracted from web pages, such as named entities and hyperlinks, to disambiguate among namesakes referred to on the web pages. We demonstrate the effectiveness of our approach by testing the efficacy of the disambiguation algorithms and its impact on person search.
Modern data processing techniques such as entity resolution, data cleaning, information extraction, and automated tagging often produce results consisting of objects whose attributes may contain uncertainty. This uncertainty is frequently captured in the form of a set of multiple mutually exclusive value choices for each uncertain attribute along with a measure of probability for alternative values. However, the lay end-user, as well as some end-applications, might not be able to interpret the results if outputted in such a form. Thus, the question is how to present such results to the user in practice, for example, to support attribute-value selection and object selection queries the user might be interested in. Specifically, in this article we study the problem of maximizing the quality of these selection queries on top of such a probabilistic representation. The quality is measured using the standard and commonly used set-based quality metrics. We formalize the problem and then develop efficient approaches that provide high-quality answers for these queries. The comprehensive empirical evaluation over three different domains demonstrates the advantage of our approach over existing techniques.
Searching for people on the Web is one of the most common query types submitted to Web search engines today. However, when a person name is queried, the returned Webpages often contain documents related to several distinct namesakes who have the queried name. The task of disambiguating and finding the Webpages related to the specific person of interest is left to the user. Many Web People Search (WePS) approaches have been developed recently that attempt to automate this disambiguation process. Nevertheless, the disambiguation quality of these techniques leaves major room for improvement. In this article, we present a new WePS approach. It is based on issuing additional auxiliary queries to the Web to gain additional knowledge about the Webpages that need to be disambiguated. Thus, the approach uses the Web as an external data source by issuing queries to collect co-occurrence statistics. These statistics are used to assess the overlap of the contextual entities extracted from the Webpages. The article also proposes a methodology to make this Web querying technique efficient. Further, the article proposes an approach that is capable of combining various types of disambiguating information, including other common types of similarities, by applying a correlation clustering approach with after-clustering of singleton clusters. These properties allow the framework to get an advantage in terms of result quality over other state-of-the-art WePS techniques.
Abstract. Nowadays many data mining/analysis applications use the graph analysis techniques for decision making. Many of these techniques are based on the importance of relationships among the interacting units. A number of models and measures that analyze the relationship importance (link structure) have been proposed (e.g., centrality, importance and page rank) and they are generally based on intuition, where the analyst intuitively decides a reasonable model that fits the underlying data. In this paper, we address the problem of learning such models directly from training data. Specifically, we study a way to calibrate a connection strength measure from training data in the context of reference disambiguation problem. Experimental evaluation demonstrates that the proposed model surpasses the best model used for reference disambiguation in the past, leading to better quality of reference disambiguation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.