A novel ensemble learning approach to unsupervised record linkage

Jurek, Anna; Hong, Jun; Yuan, Congying; Liu, Weiru

doi:10.1016/j.is.2017.06.006

Cited by 34 publications

(21 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Those examples can be generated manually one-by-one, or by leveraging tools like Snorkel. [94] generates an ensemble of automatic self-learning models that use different similarity measures. To enhance the automatic selflearning process, it incorporates attribute weighting into the automatic seed selection for each of the self-learning models.…”

Section: Supervised Learning Adaptive Matchingmentioning

confidence: 99%

An Overview of End-to-End Entity Resolution for Big Data

Efthymiou²,

et al. 2020

View full text Add to dashboard Cite

One of the most critical tasks for improving data quality and increasing the reliability of data analytics is Entity Resolution (ER), which aims to identify different descriptions that refer to the same real-world entity. Despite several decades of research, ER remains a challenging problem. In this survey, we highlight the novel aspects of resolving Big Data entities when we should satisfy more than one of the Big Data characteristics simultaneously (i.e., Volume and Velocity with Variety). We present the basic concepts, processing steps, and execution strategies that have been proposed by database, semantic Web, and machine learning communities in order to cope with the loose structuredness , extreme diversity , high speed, and large scale of entity descriptions used by real-world applications. We provide an end-to-end view of ER workflows for Big Data, critically review the pros and cons of existing methods, and conclude with the main open research directions.

show abstract

Section: Supervised Learning Adaptive Matchingmentioning

confidence: 99%

An Overview of End-to-End Entity Resolution for Big Data

Efthymiou²,

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Table X shows a comparison between the best F1 result obtained by the proposed framework and the best F1 result of other approaches. Junk et al [75] applies different classifiers over four textual datasets not related to the pharmaceutical domain, obtaining a F1 measure of 0,96. The work of Kim & Giles [76] is based on a financial dataset and obtain a F1 measure of 0,9774 with Random Forest in the best scenario.…”

Section: Analysis Of Means Plot Include the Upper Decision Limit (Udlmentioning

confidence: 99%

“…The main motivation of this research was the necessity of great pharmaceutical manufacturers to analyse a huge number of products related to their worldwide activities, considering that [75] 0.96 Kim & Giles (2016) [76] 0.9744 Proposed SVM 0.85 the same product can be registered several times by different systems using different attributes. The task of finding the records and match the products cannot be done by a human in a reasonable way, because the number of records to be matched is extremely high.…”

Section: Conclusion and Future Linesmentioning

confidence: 99%

Automatic Learning Framework for Pharmaceutical Record Matching

et al. 2020

View full text Add to dashboard Cite

Pharmaceutical manufacturers need to analyse a vast number of products in their daily activities. Many times, the same product can be registered several times by different systems using different attributes, and these companies require accurate and quality information regarding their products since these products are drugs. The central hypothesis of this research work is that machine learning can be applied to this domain to efficiently merge different data sources and match the records related to the same product. No human is able to do this in a reasonable way because the number of records to be matched is extremely high. This paper presents a framework for pharmaceutical record matching based on machine learning techniques in a big data environment. The proposed framework aims to explode the well-known rules for the matching of records from different databases for training machine learning models. Then the trained models are evaluated by predicting matches with records that do not follow these known rules. Finally, the production environment is simulated by generating a huge amount of combinations of records and predicting the matches. The obtained results show that, despite the good results obtained with the training datasets, in the production environment, the average accuracy of the best model is around 85%. That shows that matches which do not follow the known rules can be predicted and, considering that there is not a human way to process this amount of data, the results are promising.

show abstract

“…In more recent work the authors proposed to address the problem of unsupervised record linkage using graphical models [14] and multi view ensemble self-learning [15]. Discussion.…”

Section: Related Workmentioning

confidence: 99%

It Pays to Be Certain: Unsupervised Record Linkage via Ambiguity Minimization

Jurek

Deepak

2018

Advances in Knowledge Discovery and Data Mining

Self Cite

View full text Add to dashboard Cite

Record linkage (RL) is a process of identifying records that refer to the same real-world entity. Many existing approaches to RL apply supervised machine learning (ML) techniques to generate a classification model that classifies a pair of records as either linked or non-linked. In such techniques, the labeled data helps guide the choice and relative importance to similarity measures to be employed in RL. Unsupervised RL is therefore a more challenging problem since the quality of similarity measures needs to be estimated in the absence of linkage labels. In this paper we propose a novel optimization approach to unsupervised RL. We define a scoring technique which aggregates similarities between two records along all attributes and all available similarity measures using a weighted sum formulation. The core idea behind our method is embodied in an objective function representing the overall ambiguity of the scoring across a dataset. Our goal is to iteratively optimize the objective function to progressively refine estimates of the scoring weights in the direction of lesser overall ambiguity. We have evaluated our approach on multiple real world datasets which are commonly used in the RL community. Our experimental results show that our proposed approach outperforms state-of-the-art techniques, while being orders of magnitude faster.

show abstract

A novel ensemble learning approach to unsupervised record linkage

Cited by 34 publications

References 31 publications

An Overview of End-to-End Entity Resolution for Big Data

An Overview of End-to-End Entity Resolution for Big Data

Automatic Learning Framework for Pharmaceutical Record Matching

It Pays to Be Certain: Unsupervised Record Linkage via Ambiguity Minimization

Contact Info

Product

Resources

About