a b s t r a c tOne of the key challenges to realize automated processing of the information on the Web, which is the central goal of the Semantic Web, is related to the entity matching problem. There are a number of tools that reliably recognize named entities, such as persons, companies, geographic locations, in Web documents. The names of these extracted entities are, however, non-unique; the same name on different Web pages might or might not refer to the same entity. The entity matching problem concerns of identifying the entities, which are referring to the same real-world entity. This problem is very similar to the entity resolution problem studied in relational databases, however, there are also several differences. Most importantly Web pages often only contain partial or incomplete information about the entities.Similarity functions try to capture the degree of belief about the equivalence of two entities, thus they play a crucial role in entity matching. The accuracy of the similarity functions highly depends on the applied assessment techniques, but also on some specific features of the entities. We propose systematic design strategies for combined similarity functions in this context. Our method relies on the combination of multiple evidences, with the help of estimated quality of the individual similarity values and with particular attention to missing information that is common in Web context. We study the effectiveness of our method in two specific instances of the general entity matching problem, namely the person name disambiguation and the Twitter message classification problem. In both cases, using our techniques in a very simple algorithmic framework we obtained better results than the state-of-the-art methods.
No abstract
Abstract-As person names are non-unique, the same name on different Web pages might or might not refer to the same real-world person. This entity identification problem is one of the most challenging issues in realizing the Semantic Web or entity-oriented search. We address this disambiguation problem, which is very similar to the entity resolution problem studied in relational databases, however there are also several differences. Most importantly Web pages often only contain partial or incomplete information about the persons, moreover the available information is very heterogeneous, thus we are only able to obtain some uncertain evidence about whether two names refer to the same person using similarity functions. These similarity functions capture some aspects of the similarities between Web-pages, where the names occur, thus they perform very differently for the different names. We analyze some data engineering techniques to cope with the limited accuracy of the similarity functions and to combine multiple functions. Even with our simple techniques we could demonstrate systematic performance improvements and produce comparable results to state-of-the-art methods.
Twitter is a micro-blogging service on the Web, where people can enter short messages, which then become visible to other users of the service. While the topics of these messages varies, there are a lot of messages where the users express their opinions about companies or products. Since the twitter service is very popular, the messages form a rich source of information for companies. They can learn with the help of data mining and sentiment analysis techniques, how their customers like their products or what is the general perception of the company. There is however a great obstacle for analyzing the data directly: as the company names are often ambiguous, one needs first to identify, which messages are related to the company. In this paper we address this question. We present various techniques to classify tweet messages, whether they are related to a given company or not, for example, whether a message containing the keyword "apple" is about the company Apple Inc.. We present simple techniques, which make use of company profiles, which we created semi-automatically from external Web sources. Our advanced techniques take ambiguity estimations into account and also automatically extend the company profiles from the twitter stream itself. We demonstrate the effectiveness of our methods through an extensive set of experiments.
Abstract-Pervasive web and social networks are becoming part of everyone's life. Users through their activities on these networks are leaving traces of their expertise, interests and personalities. With the advances in Web mining and user modeling techniques it is possible to leverage the user social network activity history to extract the semantics of user-generated content. In this work we explore various techniques for constructing user profiles based on the content they publish on social networks. We further show that one of the advantages of maintaining social network user profiles is to provide the context for better understanding of microposts. We propose and experimentally evaluate different approaches for entity disambiguation in social networks based on syntactic and semantic features on top of two different social networks: a general-interest network (i.e., Twitter) and a domain-specific network (i.e., StackOverflow). We demonstrate how disambiguation accuracy increases when considering enriched user profiles integrating content from both social networks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.