Linkage-Data Linear Regression

Zhang, Li‐Chun; Tuoto, Tiziana

doi:10.1111/rssa.12630

Cited by 12 publications

(20 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another interesting problem for further research is how best to incorporate “calibrating” population summary information into the robust estimation procedures that we describe in this article. The pseudo‐OLS method of Zhang and Tuoto (2021) is based on the fact that under independence of different population units, the covariance between

{boldx}_{i}

and

y_{i}^{*}

is λ times the covariance between

{boldx}_{i}

and y i , where λ is an overall probability of correct linkage. However, it is unlikely that linkage errors will be homogeneous and extension of this approach to heterogeneous linkage errors, particularly those that vary between blocks, and clustered data seems worthwhile.…”

Section: Discussionmentioning

confidence: 99%

“…Figure 1 is an illustration of the role of blocks in data linkage, using fictitious individual and income data (data set

X

) and consumption data (data set

Y

). This figure is taken from Zhang and Tuoto (2021) and has been slightly modified to insert gender as a blocking variable. In Figure 1, we have five correct links (solid arrows) and two false links (dashed arrows).…”

Section: Regression Using Linked Datamentioning

confidence: 99%

“…Lahiri and Larsen (2005) propose a model of a linear regression relationship between variables in linked files and use estimated linkage probabilities based on mixture models to define a bias corrected estimator of the linear regression coefficients. Zhang and Tuoto (2021) propose a pseudo‐OLS method for secondary linear regression analysis, where neither the match‐key variables nor the unlinked records are available to the analyst, and develop a diagnostic test for the assumption of noninformative linkage errors. Chambers (2009) takes an estimating equation approach to eliminating linkage bias in the case of linear regression analysis.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Robust regression using probabilistically linked data

Chambers

Fabrizi

Ranalli

et al. 2022

WIREs Computational Stats

View full text Add to dashboard Cite

There is growing interest in a data integration approach to survey sampling, particularly where population registers are linked for sampling and subsequent analysis. The reason for doing this is simple: it is only by linking the same individuals in the different sources that it becomes possible to create a data set suitable for analysis. But data linkage is not error free. Many linkages are nondeterministic, based on how likely a linking decision corresponds to a correct match, that is, it brings together the same individual in all sources. High quality linking will ensure that the probability of this happening is high. Analysis of the linked data should take account of this additional source of error when this is not the case. This is especially true for secondary analysis carried out without access to the linking information, that is, the often confidential data that agencies use in their record matching. We describe an inferential framework that allows for linkage errors when sampling from linked registers. After first reviewing current research activity in this area, we focus on secondary analysis and linear regression modeling, including the important special case of estimation of subpopulation and small area means. In doing so we consider both robustness and efficiency of the resulting linked data inferences. This article is categorized under: Algorithms and Computational Methods > Maximum Likelihood Methods Statistical Learning and Exploratory Methods of the Data Sciences > Modeling Methods Statistical and Graphical Methods of Data Analysis > Multivariate Analysisefficiency, exchangeable linkage error, finite population inference, linked data, regression, robust estimation | INTRODUCTIONData linkage is now an inextricable part of how data are obtained for analysis in modern science and public administration. The classical paradigm of first identifying a well-defined target population that can provide the data of interest and then measuring the values of the relevant variables for the individuals making up this population, or from a sample taken from it, is now often replaced by a data integration approach. This first links the records for the same individuals that are stored in the many population registers that are now available and then treats the resulting linked

show abstract

{boldx}_{i}

and

y_{i}^{*}

is λ times the covariance between

{boldx}_{i}

Section: Discussionmentioning

confidence: 99%

“…Figure 1 is an illustration of the role of blocks in data linkage, using fictitious individual and income data (data set

X

) and consumption data (data set

Y

Section: Regression Using Linked Datamentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Robust regression using probabilistically linked data

Chambers

Fabrizi

Ranalli

et al. 2022

WIREs Computational Stats

View full text Add to dashboard Cite

show abstract

“…Because the possible matches require manual review which is sometimes not available, Grannis et al (2003) propose to establish only a single threshold to avoid human review. Although the matching scores and the posterior probabilities produce the same ordering for record pairs (Larsen & Rubin, 2001), the posterior probabilities are preferable in our case because they may be useful for further analyses (Lahiri & Larsen, 2005;Kim & Chambers, 2012;Hof & Zwinderman, 2012;Zhang & Tuoto, 2020).…”

Section: Probabilistic Record Linkagementioning

confidence: 99%

Extending the Fellegi-Sunter record linkage model for mixed-type data with application to the French national health data system

Chauvet

Happe³

et al. 2023

Computational Statistics & Data Analysis

View full text Add to dashboard Cite

“…A setting specifically considered in Chambers and da Silva (2020) and Kim and Chambers (2012) is termed exchangeable linkage error ( ELE ), in which the off‐diagonal elements of

Q_{b}

are constant (and hence all diagonal elements are equal to a complementary constant). Novel insights into the ELE setting are recently presented in Zhang and Tuoto (2020).…”

Section: Linear Regression With Linked Datasetsmentioning

confidence: 99%

Regression with linked datasets subject to linkage error

Wang

Ben-David

Diao

et al. 2021

WIREs Computational Stats

View full text Add to dashboard Cite

Data are often collected from multiple heterogeneous sources and are combined subsequently. In combing data, record linkage is an essential task for linking records in datasets that refer to the same entity. Record linkage is generally not error-free; there is a possibility that records belonging to different entities are linked or that records belonging to the same entity are missed. It is not advisable to simply ignore such errors because they can lead to data contamination and introduce bias in sample selection or estimation, which, in return, can lead to misleading statistical results and conclusions. For a long while, this problem was not properly recognized, but in recent years a growing number of researchers have developed methodology for dealing with linkage errors in regression analysis with linked datasets. The main goal of this overview is to give an account of those developments, with an emphasis on recent approaches and their connection to the so-called "Broken Sample" problem.We also provide a short empirical study that illustrates the efficacy of corrective methods in different scenarios. This article is categorized under: Statistical Models > Model Selection Statistical and Graphical Methods of Data Analysis > Robust Methods Statistical and Graphical Methods of Data Analysis > Multivariate Analysis

show abstract

Linkage-Data Linear Regression

Cited by 12 publications

References 40 publications

Robust regression using probabilistically linked data

Robust regression using probabilistically linked data

Extending the Fellegi-Sunter record linkage model for mixed-type data with application to the French national health data system

Regression with linked datasets subject to linkage error

Contact Info

Product

Resources

About