A split-and-conquer approach for analysis of

Chen, Xueying; Xie, Minge

doi:10.5705/ss.2013.088

Cited by 134 publications

(164 citation statements)

References 0 publications

Supporting

Mentioning

161

Contrasting

Order By: Relevance

“…A partial list of references covering DC algorithms from a statistical perspective is Chen and Xie (2012), Zhang et al (2013), Kleiner et al (2014), Liu and Ihler (2014) and Zhao et al (2014a). The closest works to ours are Zhang et al (2013), Lee et al (2015) and Rosenblatt and Nadler (2016).…”

Section: Introductionmentioning

confidence: 93%

Distributed testing and estimation under sparse high dimensional models

Battey¹,

Fan²,

Liu³

et al. 2018

Ann. Statist.

163

114

View full text Add to dashboard Cite

This paper studies hypothesis testing and parameter estimation in the context of the divide-and-conquer algorithm. In a unified likelihood based framework, we propose new test statistics and point estimators obtained by aggregating various statistics from k subsamples of size n/k, where n is the sample size. In both low dimensional and sparse high dimensional settings, we address the important question of how large k can be, as n grows large, such that the loss of efficiency due to the divide-and-conquer algorithm is negligible. In other words, the resulting estimators have the same inferential efficiencies and estimation rates as an oracle with access to the full sample. Thorough numerical results are provided to back up the theory.

show abstract

Section: Introductionmentioning

confidence: 93%

Distributed testing and estimation under sparse high dimensional models

Battey¹,

Fan²,

Liu³

et al. 2018

Ann. Statist.

163

114

View full text Add to dashboard Cite

show abstract

“…Although a large number of statistical methods and computational recipes have been developed to address various challenges for big data analytics, such as the subsampling-based methods (Liang et al, 2013;Kleiner et al, 2014;Ma et al, 2015) divide-and-conquer techniques (Lin and Xi, 2011;Guha et al, 2012;Chen and Xie, 2014;Tang et al, 2019;Zhou and Song, 2017), little is known about statistical inference in streaming data analyses under dynamic data storage and incremental updates. This paper has filled the gap with the proposed renewable estimation and incremental inference.…”

Section: Discussionmentioning

confidence: 99%

Renewable Estimation and Incremental Inference in Generalized Linear Models with Streaming Data Sets

Luo

Song

2019

Journal of the Royal Statistical Society Series B: Statistical Methodology

View full text Add to dashboard Cite

Summary The paper presents an incremental updating algorithm to analyse streaming data sets using generalized linear models. The method proposed is formulated within a new framework of renewable estimation and incremental inference, in which the maximum likelihood estimator is renewed with current data and summary statistics of historical data. Our framework can be implemented within a popular distributed computing environment, known as Apache Spark, to scale up computation. Consisting of two data‐processing layers, the rho architecture enables us to accommodate inference‐related statistics and to facilitate sequential updating of the statistics used in both estimation and inference. We establish estimation consistency and asymptotic normality of the proposed renewable estimator, in which the Wald test is utilized for an incremental inference. Our methods are examined and illustrated by various numerical examples from both simulation experiments and a real world data analysis.

show abstract

“…Chen and Xie (2014) consider a divide and conquer approach for generalized linear models (GLM) where both the sample size n and the number of covariates p are large, by incorporating variable selection via penalized regression into a subset processing step. More specifically, for p bounded or increasing to infinity slowly, ( p n not faster than o ( e n k ), while model size may increase at a rate of o ( n k )), they propose to first randomly split the data of size n into K blocks (size n k = O ( n/K )).…”

Section: Methodsmentioning

confidence: 99%

“…Sound statistical procedures that are scalable computationally to massive datasets have been proposed (Jordan, 2013). Examples are subsampling-based approaches (Kleiner et al, 2014; Ma, Mahoney and Yu, 2013; Liang et al, 2013; Maclaurin and Adams, 2014), divide and conquer approaches (Lin and Xi, 2011; Chen and Xie, 2014; Song and Liang, 2014; Neiswanger, Wang and Xing, 2013), and online updating approaches (Schifano et al, 2015). From a computational perspective, much effort has been put into the most active, open source statistical environment, (R Core Team, 2014a).…”

Section: Introductionmentioning

confidence: 99%

Statistical methods and computing for big data

Wang

Chen

Schifano

et al. 2016

Statistics and Its Interface

View full text Add to dashboard Cite

Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the capacity of standard analytic tools. They present opportunities as well as challenges to statisticians. The role of computational statisticians in scientific discovery from big data analyses has been under-recognized even by peer statisticians. This article summarizes recent methodological and software developments in statistics that address the big data challenges. Methodologies are grouped into three classes: subsampling-based, divide and conquer, and online updating for stream data. As a new contribution, the online updating approach is extended to variable selection with commonly used criteria, and their performances are assessed in a simulation study with stream data. Software packages are summarized with focuses on the open source and packages, covering recent tools that help break the barriers of computer memory and computing power. Some of the tools are illustrated in a case study with a logistic regression for the chance of airline delay.

show abstract

A split-and-conquer approach for analysis of

Cited by 134 publications

References 0 publications

Distributed testing and estimation under sparse high dimensional models

Distributed testing and estimation under sparse high dimensional models

Renewable Estimation and Incremental Inference in Generalized Linear Models with Streaming Data Sets

Statistical methods and computing for big data

Contact Info

Product

Resources

About