2017
DOI: 10.1002/sim.7287
|View full text |Cite
|
Sign up to set email alerts
|

A scaling approach to record linkage

Abstract: With increasing availability of large datasets derived from administrative and other sources, there is an increasing demand for the successful linking of these to provide rich sources of data for further analysis. Variation in the quality of identifiers used to carry out linkage means that existing approaches are often based upon 'probabilistic' models, which are based on a number of assumptions, and can make heavy computational demands. In this paper, we suggest a new approach to classifying record pairs in l… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
14
0
2

Year Published

2017
2017
2023
2023

Publication Types

Select...
7
1

Relationship

2
6

Authors

Journals

citations
Cited by 17 publications
(16 citation statements)
references
References 14 publications
0
14
0
2
Order By: Relevance
“… 20 Alternative methods also exist. 21 , 22 In a previous study, we used a combination of these techniques to create a mother-baby cohort of records from Hospital Episode Statistics (HES), an administrative data resource that holds detailed information of all admissions to National Health Service (NHS) hospitals in England. 23 The methods are described in full elsewhere, but comprised deterministic and probabilistic linkage of de-identified information in data items contained in both the mother’s delivery record and the baby’s birth record.…”
Section: Evaluating the Impact Of Data Linkage Errormentioning
confidence: 99%
“… 20 Alternative methods also exist. 21 , 22 In a previous study, we used a combination of these techniques to create a mother-baby cohort of records from Hospital Episode Statistics (HES), an administrative data resource that holds detailed information of all admissions to National Health Service (NHS) hospitals in England. 23 The methods are described in full elsewhere, but comprised deterministic and probabilistic linkage of de-identified information in data items contained in both the mother’s delivery record and the baby’s birth record.…”
Section: Evaluating the Impact Of Data Linkage Errormentioning
confidence: 99%
“…The ‘Bloomsbury Group’ of the Administrative Data Research Centre for England, led by University College London in conjunction with researchers from the London School of Hygiene and Tropical Medicine and Institute for Fiscal Studies, and partners at the Office for National Statistics, has been directly grappling with these methodological issues for the past 4 years. We provide training on the linkage and analysis of administrative data, focusing on issues of data quality and linkage error and resources to help people to understand high value data sets (Herbert et al ., ) and linked data (Gilbert et al ., ) (challenges 1 and 12), constructed data quality metrics and showed how they can be incorporated in data linkage and analysis (Harron et al ., ; Hagger‐Johnson et al ., ) (challenge 3), demonstrated how linkage error can influence statistical conclusions (Harron et al ., ) (challenge 6), developed improved methods for record linkage (Goldstein et al ., ; Harron et al ., ) and techniques for analysis of linked data that better account for limitations of data quality and uncertainty (Goldstein et al ., ; Harron et al ., ) (challenges 12 and 13), are collaborating with population‐based longitudinal studies and the Office for National Statistics on triangulating data across multiple administrative, registry and primary data collections (challenge 14), are developing new methods for anonymization that retain the necessary properties for valid and efficient statistical inference (challenge 15) and tackle all the challenges in our programme of exemplary studies of linked data, which are designed to highlight both the research potential and the limitations of specific administrative data sets. …”
mentioning
confidence: 97%
“…There are a number of methods for deriving such probabilistic match weights or scores but most are based on the Fellegi-Sunter algorithm, which uses the conditional probability of agreement on an identifier, given whether two records belong to the same individual or not (Fellegi and Sunter 1969;Sayers et al 2016). However, this approach relies on a number of assumptions (Goldstein et al 2017). It also involves an initial estimation of conditional probabilities either using training data (where the true match status is known for a sample of records) or using statistical techniques such as the EM algorithm.…”
Section: Methods For Data Linkagementioning
confidence: 99%
“…However, suitable training datasets are rarely available to support these methods (Christen and Pudjijono 2009). An alternative unsupervised method, employing a scaling algorithm originating from correspondence analysis, has been developed to overcome this problem but is yet to be implemented outside of simulation studies (Goldstein et al 2017). The scaling algorithm assigns scores to discrete categories or degrees of agreement/disagreement based upon minimisation of a suitable loss function.…”
Section: Methods For Data Linkagementioning
confidence: 99%