Febrl – A Parallel Open Source Data Linkage System

Christen, Peter; Churches, Tim; Hegland, Markus

doi:10.1007/978-3-540-24775-3_75

Cited by 87 publications

(79 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…BN [38] MOMA [55] SERF [5] Active Atlas [53,54] MARLIN [11,12] Multiple Classifier System [62] Operator Trees [13] TAILOR [24] FEBRL [18,17] STEM [36] Context Based Framework [16] Training-based between two entities. The previously proposed approaches mostly assume that corresponding attributes from the input datasets have been determined beforehand, either manually or with the help of schema matching.…”

Section: Matchersmentioning

confidence: 99%

Frameworks for entity matching: A comparison

Köpcke

Rahm

2010

Data & Knowledge Engineering

346

204

View full text Add to dashboard Cite

a b s t r a c tEntity matching is a crucial and difficult task for data integration. Entity matching frameworks provide several methods and their combination to effectively solve different match tasks. In this paper, we comparatively analyze 11 proposed frameworks for entity matching. Our study considers both frameworks which do or do not utilize training data to semiautomatically find an entity matching strategy to solve a given match task. Moreover, we consider support for blocking and the combination of different match algorithms. We further study how the different frameworks have been evaluated. The study aims at exploring the current state of the art in research prototypes of entity matching frameworks and their evaluations. The proposed criteria should be helpful to identify promising framework approaches and enable categorizing and comparatively assessing additional entity matching frameworks and their evaluations.

show abstract

Section: Matchersmentioning

confidence: 99%

Frameworks for entity matching: A comparison

Köpcke

Rahm

2010

Data & Knowledge Engineering

346

204

View full text Add to dashboard Cite

show abstract

“…The authors of [4] show how the match computation can be parallelized among several cores on a single node. Parallel evaluation of the Cartesian product of two sources is described in [8].…”

Section: Related Workmentioning

confidence: 99%

Parallel Entity Resolution with Dedoop

Kolb

Rahm

2012

Datenbank Spektrum

View full text Add to dashboard Cite

We provide an overview of Dedoop (Deduplication with Hadoop), a new tool for parallel entity resolution (ER) on cloud infrastructures. Dedoop supports a browserbased specification of complex ER strategies and provides a large library of blocking and matching approaches. To simplify the configuration of ER strategies with several similarity metrics, training-based machine learning approaches can be employed with Dedoop. Specified ER strategies are automatically translated into MapReduce jobs for parallel execution on different Hadoop clusters. For improved performance, Dedoop supports redundancy-free multi-pass blocking as well as advanced load balancing approaches. To illustrate the usefulness of Dedoop, we present the results of a comparative evaluation of different ER strategies on a challenging real-world dataset.

show abstract

“…Additionally, while for many regular words there is only one correct spelling, there are often different written forms of proper names, for example 'Gail' and 'Gayle'. The main task of data cleaning and standardisation is the conversion of the raw input data into well defined, consistent forms and the resolution of inconsistencies in the way information is represented or encoded [9,10].…”

Section: Data Linkage Processmentioning

confidence: 99%

“…As discussed earlier, this is computationally feasible only for small data sets. In practise, blocking, filtering, indexing, searching, or sorting algorithms [2,9,15,21,23] are used to reduce the number of record pair comparisons as discussed in Section 2.1. The aim of such algorithms is to cheaply remove as many record pairs from the set of non-matches U that are obvious nonmatches, without removing any record pairs from the set of matches M .…”

Section: Blocking and Complexity Measuresmentioning

confidence: 99%

“…A deduplication was then performed using the Febrl (Freely extensible biomedical record linkage) [9] data linkage system. Fourteen attributes in the MDC were compared using various comparison functions (like exact and approximate string, and date of birth comparisons), and the resulting fourteen numerical weights were summed into a matching weight R as discussed in Section 2.3.…”

Section: Experimental Examplesmentioning

confidence: 99%

See 1 more Smart Citation

Quality and Complexity Measures for Data Linkage and Deduplication

Christen

Goiser

2007

Studies in Computational Intelligence

Self Cite

151

126

View full text Add to dashboard Cite

Abstract. Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when linking or deduplicating very large data sets. Different measures have been used to characterise the quality and complexity of data linkage algorithms, and several new metrics have been proposed. An overview of the issues involved in measuring data linkage and deduplication quality and complexity is presented in this chapter. It is shown that measures in the space of record pair comparisons can produce deceptive accuracy results. Various measures are discussed and recommendations are given on how to assess data linkage and deduplication quality and complexity.

show abstract

Febrl – A Parallel Open Source Data Linkage System

Cited by 87 publications

References 10 publications

Frameworks for entity matching: A comparison

Frameworks for entity matching: A comparison

Parallel Entity Resolution with Dedoop

Quality and Complexity Measures for Data Linkage and Deduplication

Contact Info

Product

Resources

About