The merge/purge problem for large databases

Hernández, Mauricio A.; Stolfo, Salvatore J.

doi:10.1145/568271.223807

Cited by 306 publications

(247 citation statements)

References 9 publications

Supporting

Mentioning

244

Contrasting

Unclassified

Order By: Relevance

“…Extended-Manual manually specifies the matching rules (e.g., "if similarity(name1,name2) ≥ 0.8 but position=student then the two tuples do not match"). Thus, in a sense this method extends the manual method described in [46], which would exploit only shared attributes such as "name1" and "name2". Extended-AR is similar to Extended-Manual, but uses the association rule classification method of [63] to guide the process of generating rules.…”

Section: Algorithms and Methodologiesmentioning

confidence: 99%

“…Object matching is often used to consolidate information about entities and to remove duplicates when merging multiple information sources. As such, it plays an important role in many information processing contexts, including information integration, data warehousing, information extraction, and text join in databases (e.g., [95,21,72,98,7,57,1,90,42,46]). …”

Section: Object Matching Across Disparate Data Sourcesmentioning

confidence: 99%

“…The performance of matching algorithms have typically been evaluated with matching accuracy and runtime efficiency [46,1]. As the first step, in this section we shall focus on improving matching accuracy.…”

Section: Problem Definitionmentioning

confidence: 99%

See 2 more Smart Citations

Mining for Information Discovery on the Web: Overview and Illustrative Research

Yu¹,

Doan²

2004

Intelligent Technologies for Information Analysis

View full text Add to dashboard Cite

Summary. The Web has become a fertile ground for numerous research activities in mining. In this chapter we discuss research on finding targeted information on the Web. First, we briefly survey the research area. We focus in particular on two key issues: (a) mining to impose structures over Web data, for example by building taxonomies and portals, to aid in Web navigation, and (b) mining to build information processing systems, such as search engines, question answering systems, and data integration ones. Next, we describe two recent Web mining projects that illustrate the use of mining techniques to address the above two key issues. We conclude by briefly discussing novel research opportunities in the area of mining for information discovery on the Web.

show abstract

Section: Algorithms and Methodologiesmentioning

confidence: 99%

Section: Object Matching Across Disparate Data Sourcesmentioning

confidence: 99%

See 1 more Smart Citation

Mining for Information Discovery on the Web: Overview and Illustrative Research

Yu¹,

Doan²

2004

Intelligent Technologies for Information Analysis

View full text Add to dashboard Cite

show abstract

“…Identification of tuples representing the same individual is accomplished by the unique object identifier ID. The problem of assigning these object identifiers is not considered within this paper, i.e., we assume a preceding duplicate detection step (see for example Hernandez and Stolfo (1995)). Note that we are only interested in finding update operations that introduce conflicts between the overlapping parts of databases.…”

Section: Reproducing Conflict Generationmentioning

confidence: 99%

Classification of Contradiction Patterns

Müller

Leser

Freytag

2007

Studies in Classification, Data Analysis, and Knowledge Organization

View full text Add to dashboard Cite

Abstract. Solving conflicts between overlapping databases requires an understanding of the reasons that lead to the inconsistencies. Provided that conflicts do not occur randomly but follow certain regularities, patterns in the form of "If condition Then conflict" provide a valuable means to facilitate their understanding. In previous work, we adopt existing association rule mining algorithms to identify such patterns. Within this paper we discuss extensions to our initial approach aimed at identifying possible update operations that caused the conflicts between the databases. This is done by restricting the items used for pattern mining. We further propose a classification of patterns based on mappings between the contradicting values to represent special cases of conflict generating updates.

show abstract

“…Record linkage (RL), also known as the merge-purge [12] or object identity [24] problem, is one of the key tasks in data cleaning [10] and integration [9]. Its goal is to identify related records that are associated with the same entity from multiple databases.…”

Section: Introductionmentioning

confidence: 99%

Privacy Preserving Group Linkage

Chen

Luo

et al. 2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. The problem of privacy-preserving record linkage is to find the intersection of records from two parties, while not revealing any private records to each other. Recently, group linkage has been introduced to measure the similarity of groups of records [19]. When we extend the traditional privacy-preserving record linkage methods to group linkage measurement, group membership privacy becomes vulnerable -record identity could be discovered from unlinked groups. In this paper, we introduce threshold privacy-preserving group linkage (TPPGL) schemes, in which both parties only learn whether or not the groups are linked. Therefore, our approach is secure under group membership inference attacks. In experiments, we show that using the proposed TPPGL schemes, group membership privacy is well protected against inference attacks with a reasonable overhead.

show abstract

The merge/purge problem for large databases

Cited by 306 publications

References 9 publications

Mining for Information Discovery on the Web: Overview and Illustrative Research

Mining for Information Discovery on the Web: Overview and Illustrative Research

Classification of Contradiction Patterns

Privacy Preserving Group Linkage

Contact Info

Product

Resources

About