A novel case‐control subsampling approach for rapid model exploration of large clustered binary data

Wright, S.; Ryan, Louise; Pham, Tung

doi:10.1002/sim.7543

Cited by 3 publications

(16 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…patients for whom Y ki =0) and n 1 k cases (i.e. patients for whom Y ki =1) from the N 0 k non‐cases and N 1 k cases respectively in the k th hospital (Wright et al ., ; Haneuse and Rivera‐Rodriguez, ). To realize the gains of an ODS design, we must resolve the fact that the individuals who have ‘complete’ data are no longer representative of the underlying population.…”

Section: Outcome‐dependent Samplingmentioning

confidence: 99%

“…As indicated in Section 1, Wright et al . () also considered estimation and inference for a GLMM based on data collected via a CSCC sampling scheme. Briefly, let ξ k =

\log {P false(S_{ki} = 1 false| Y_{ki} = 1 false) / P false(S_{ki} = 1 false| Y_{ki} = 0 false)}

.…”

Section: Outcome‐dependent Samplingmentioning

confidence: 99%

“…Wright et al . () further considered this design in ‘big data’ settings when interest is in an LNGLMM, but actually fitting the model to the entire data set is computationally infeasible. They proposed that inference proceed by including cluster‐specific offset terms in the regression model, although the derivation of the offset terms hinges on the choice of the logit link function.…”

Section: Introductionmentioning

confidence: 99%

“…Second, we find that the offsetted regression approach of Wright et al . () significantly underestimates between‐hospital variability and hence is unsuitable for profiling. We thus adopt the PML approach and extend it to compute hospital‐specific quality measures.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Outcome-Dependent Sampling in Cluster-Correlated Data Settings with Application to Hospital Profiling

McGee

Schildcrout

Normand

et al. 2019

Journal of the Royal Statistical Society Series A: Statistics in Society

View full text Add to dashboard Cite

Summary Hospital readmission is a key marker of quality of healthcare and an important policy measure, used by the Centers for Medicare and Medicaid Services to determine, in part, reimbursement rates. Currently, analyses of readmissions are based on a logistic–normal generalized linear mixed model that permits estimation of hospital‐specific measures while adjusting for case mix differences. Recent moves to identify and address healthcare disparities call for expanding case mix adjustment to include measures of socio‐economic status while minimizing additional burden to hospitals associated with collecting data on such measures. Towards resolving this dilemma, we propose that detailed socio‐economic data be collected on a subsample of patients via an outcome‐dependent sampling scheme, specifically the cluster‐stratified case–control design. Estimation and inference, for both the fixed and the random‐effects components, are performed via pseudo‐maximum‐likelihood wherein inverse probability weights are incorporated in the usual integrated likelihood to account for the design. In comprehensive simulations, cluster‐stratified case–control sampling proves to be an efficient design whenever interest lies in fixed or random effects of a generalized linear mixed model and covariates are unobserved or expensive to collect. The methods are motivated by and illustrated with an analysis of N = 889661 Medicare beneficiaries hospitalized between 2011 and 2013 with congestive heart failure at one of K = 3116 hospitals. Results highlight that the framework proposed provides a means of mitigating disparities in terms of which hospitals are indicated as being poor performers, relative to a naive analysis that fails to adjust for missing case mix variables.

show abstract

Section: Outcome‐dependent Samplingmentioning

confidence: 99%

“…As indicated in Section 1, Wright et al . () also considered estimation and inference for a GLMM based on data collected via a CSCC sampling scheme. Briefly, let ξ k =

\log {P false(S_{ki} = 1 false| Y_{ki} = 1 false) / P false(S_{ki} = 1 false| Y_{ki} = 0 false)}

.…”

Section: Outcome‐dependent Samplingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Outcome-Dependent Sampling in Cluster-Correlated Data Settings with Application to Hospital Profiling

McGee

Schildcrout

Normand

et al. 2019

Journal of the Royal Statistical Society Series A: Statistics in Society

View full text Add to dashboard Cite

show abstract

“…What was particularly interesting in the Wright et al. () analysis was that they did a clustered analysis, treating donor center, as the cluster. With a very simple adjustment via inclusion of an offset, models could be easily fit using R functions glm or gam using only a desktop computer.…”

Section: Extensionsmentioning

confidence: 99%

Statistical strategies for the analysis of massive data sets

Hwang

Ryan

2019

Biometrical J

View full text Add to dashboard Cite

The advent of the big data age has changed the landscape for statisticians. Public and private organizations alike these days are interested in capturing and analyzing complex customer data in order to improve their service and drive efficiency gains. However, the large volume of data involved often means that standard statistical methods fail and new ways of thinking are needed. Although great gains can be obtained through the use of more advanced computing environments or through developing sophisticated new statistical algorithms that handle data in a more efficient way, there are also many simpler things that can be done to handle large data sets in an efficient and intuitive manner. These include the use of distributed analysis methodologies, clever subsampling, data coarsening, and clever data reductions that exploit concepts such as sufficiency. These kinds of strategies represent exciting opportunities for statisticians to remain front and center in the data science world.

show abstract

On The Interplay between Exposure Misclassification and Informative Cluster Size

McGee

Kioumourtzoglou

Weisskopf

et al. 2020

Journal of the Royal Statistical Society Series C: Applied Statistics

View full text Add to dashboard Cite

A recent multigenerational study of diethylstilbestrol and attention deficit hyperactivity disorder exhibited signs of both informative cluster size-the outcome was more prevalent in small families-and exposure misclassification-self-report of familial diethylstilbestrol exposure was substantially mismeasured. Motivated by this, we study the effect of exposure misclassification when cluster size is potentially informative and, in particular, when misclassification is differential by cluster size. We find that: misclassification in an exposure that is related to cluster size induces informativeness when cluster size would otherwise be non-informative; and misclassification that is differential by informative cluster size may attenuate, inflate or possibly reverse the sign of estimates. To mitigate these issues, we propose an observed likelihood correction for joint models of cluster size and outcomes, and an expected estimating equations correction. We evaluate these approaches in simulations and in application to the motivating data from the second Nurses Health Study, NHS II.

show abstract

A novel case‐control subsampling approach for rapid model exploration of large clustered binary data

Cited by 3 publications

References 26 publications

Outcome-Dependent Sampling in Cluster-Correlated Data Settings with Application to Hospital Profiling

Outcome-Dependent Sampling in Cluster-Correlated Data Settings with Application to Hospital Profiling

Statistical strategies for the analysis of massive data sets

On The Interplay between Exposure Misclassification and Informative Cluster Size

Contact Info

Product

Resources

About