Finding similar identities among objects from multiple web sources

Carvalho, Joyce C. P.; Silva, Altigran S. da

doi:10.1145/956699.956719

Cited by 26 publications

(11 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our approach differs from others in the literature since it can be used to identify and match objects more complexly structured (e.g., XML documents) and not only objects with a flat structure such as relations. The effectiveness of our approach has been demonstrated by means of experiments with real Web data sources from different domains, whose results have reached precision levels above 75% [17].…”

Section: Web Data Integrationmentioning

confidence: 96%

“…We have also worked on the problem of integrating data from multiple Web sources [17]. We consider Web sources with objects that can have different formats and structures, which makes it difficult to identify those that can be matched together.…”

Section: Web Data Integrationmentioning

confidence: 99%

“…The main aim of this research effort is to develop methods and tools for dealing with data available on the Web and in other non-structured sources (e.g., XML documents), thus providing facilities similar to those available in traditional database systems for managing such data. Specific problems addressed by the two research groups include or are related to data and topic-oriented focused crawling [7]- [9], Web data extraction [10]- [12], unstructured queries over semistructured and structured data [13]- [15], and Web data integration [16], [17], .…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Cooperative Research on Web Data Management at UFMG and UFAM - A Brief Report

Laender

Silva

2008

2008 Latin American Web Conference

View full text Add to dashboard Cite

The World Wide Web has become a huge repository of data of interest for a variety of application domains. However, the same features that have made the Web so useful and popular also impose important restrictions on the way the data it contains can be manipulated. Particularly, in the traditional Web scenario, there is an inherent difficulty in gaining access to data that is implicitly present in Web pages but is not readily available. The term Web Data Management (WDM) has been used to refer to the study of problems related to fetching, extracting, querying, modeling, storing, transforming, and integrating data available on the Web. These issues have been growing in importance in the scientific community in the last years, as it can be be seen by the considerable space devoted to them in important publication venues. This interest is justified not only by the scientific and technical challenges involved in WDM problems, but also, and specially, by the growing demand from the industry for solving such problems. In this paper, we present a brief report on the WDM cooperative research carried out by the Database Laboratory at the Federal University of Minas Gerais (LBD/UFMG) and the Information Technology Group at the Federal University of Amazonas (GTI/UFAM). The main aim of this research effort is to develop methods and tools for dealing with data available on the Web and in other non-structured sources (e.g., XML documents), thus providing facilities similar to those available in traditional database systems for managing such data.

show abstract

Section: Web Data Integrationmentioning

confidence: 96%

Section: Web Data Integrationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Cooperative Research on Web Data Management at UFMG and UFAM - A Brief Report

Laender

Silva

2008

2008 Latin American Web Conference

View full text Add to dashboard Cite

show abstract

“…Chaudhuri et al [12] propose a probabilistic algorithm for retrieving the K records closest to a input record, according to a fuzzy match similarity function that considers the weight of words using the Inverse Document Frequency (IDF) [10]. Carvalho and da Silva [13] also use the vector space model to calculate the similarity between objects from multiple sources. Their approach can be used to deduplicate objects with complex structures such as XML documents.…”

Section: Related Workmentioning

confidence: 99%

An Automatic Approach for Duplicate Bibliographic Metadata Identification Using Classification

Borges

Becker

Heuser

et al. 2011

2011 30th International Conference of the Chilean Computer Science Society

View full text Add to dashboard Cite

References are the main descriptive metadata used by digital libraries of scientific articles. These references can be represented by several formats and styles. Although considerable content variations can also occur in some metadata fields such as title, author names and publication venue. Duplicate records influence the quality of digital library services once they need to be appropriately identified and treated. This paper presents an approach to identifying duplicated bibliographic metadata. We extend our previous work so that instead of setting thresholds based on the scores returned by similarity functions, we use the scores to train classification algorithms which automatically identify duplicated references. The experiments show that the classifiers increases up to 11% the quality of results when compared to our unsupervised heuristic-based approach.

show abstract

“…Among the main challenges are the problems of choosing what evidence to use, and how to find the best weighting schema to apply to the chosen evidence. This has led the research community to develop a number of alternative methods [3,4,6,7,8,9,12,18].…”

Section: Related Workmentioning

confidence: 99%

Learning to deduplicate

Laender

Silva

Gonçalves

et al. 2006

Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries

Self Cite

View full text Add to dashboard Cite

Identifying record replicas in Digital Libraries and other types of digital repositories is fundamental to improve the quality of their content and services as well as to yield eventual sharing efforts. Several deduplication strategies are available, but most of them rely on manually chosen settings to combine evidence used to identify records as being replicas. In this paper, we present the results of experiments we have carried out with a novel Machine Learning approach we have proposed for the deduplication problem. This approach, based on Genetic Programming (GP), is able to automatically generate similarity functions to identify record replicas in a given repository. The generated similarity functions properly combine and weight the best evidence available among the record fields in order to tell when two distinct records represent the same real-world entity. The results of the experiments show that our approach outperforms the baseline method by Fellegi and Sunter by more than 12% when identifying replicas in a data set containing researcher's personal data, and by more than 7%, in a data set with article citation data.

show abstract

Finding similar identities among objects from multiple web sources

Cited by 26 publications

References 5 publications

Cooperative Research on Web Data Management at UFMG and UFAM - A Brief Report

Cooperative Research on Web Data Management at UFMG and UFAM - A Brief Report

An Automatic Approach for Duplicate Bibliographic Metadata Identification Using Classification

Learning to deduplicate

Contact Info

Product

Resources

About