Communication Efficient Distributed Kernel Principal Component Analysis

Balcan, Maria Florina; Liang, Yingyu; Song, Le; Woodruff, David P.; Xie, Bo

doi:10.1145/2939672.2939796

Cited by 35 publications

(64 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Second, even the implementation of simple methods is not straightforward when extremely large data sets are involved. In other words, devising and implementing a numerically efficient ‘Big Data PCA’ is a non‐trivial task (Balcan et al ., ). At least two steps must be considered: adopting an appropriate machine learning model (e.g.…”

Section: Big Data Analyticsmentioning

confidence: 97%

Big Data for weed control and crop protection

et al. 2017

View full text Add to dashboard Cite

SummaryFarmers have access to many data-intensive technologies to help them monitor and control weeds and pests. Data collection, data modelling and analysis, and data sharing have become core challenges in weed control and crop protection. We review the challenges and opportunities of Big Data in agriculture: the nature of data collected, Big Data analytics and tools to present the analyses that allow improved crop management decisions for weed control and crop protection. Big Data storage and querying incurs significant challenges, due to the need to distribute data across several machines, as well as due to constantly growing and evolving data from different sources. Semantic technologies are helpful when data from several sources are combined, which involves the challenge of detecting interactions of potential agronomic importance and establishing relationships between data items in terms of meanings and units. Data ownership is analysed using the ethical matrix method to identify the concerns of farmers, agribusiness owners, consumers and the environment. Big Data analytics models are outlined, together with numerical algorithms for training them. Advances and tools to present processed Big Data in the form of actionable information to farmers are reviewed, and a success story from the Netherlands is highlighted. Finally, it is argued that the potential utility of Big Data for weed control is large, especially for invasive, parasitic and herbicide-resistant weeds. This potential can only be realised when agricultural scientists collaborate with data scientists and when organisational, ethical and legal arrangements of data sharing are established.

show abstract

Section: Big Data Analyticsmentioning

confidence: 97%

Big Data for weed control and crop protection

et al. 2017

View full text Add to dashboard Cite

show abstract

“…Besides estimation, other distributed statistical technique may be of interests, such as the distributed principal component analysis (Balcan, Kanchanapally, Liang, & Woodruff, 2014), consensus-based distributed SVMs (Forero, Cano, & Giannakis, 2010), which utilizes ADMM (Boyd et al, 2011), and so on. Distributed version of topics like nonnegative matrix factorization, as a data analysis technique, high-dimensional structured nonparametric model, which is the sparse additive model (Fan, Feng, & Song, 2011;Ravikumar, Lafferty, Liu, & Wasserman, 2009), are also of interest.…”

Section: Related Work and Open Questionsmentioning

confidence: 99%

Aggregated inference

Huo

Cao

2018

WIREs Computational Stats

View full text Add to dashboard Cite

Aggregated inference on distributed data becomes more and more important due to the larger size of data collected in different industries. Modeling and inference are needed in the case where data cannot be obtained at a central location; aggregated statistical inference is a major tool to solve the aforementioned problems. In the literature, problems under the setting of regression model (more generally, M‐estimator) are extensively studied. There are at least two popular techniques for distributed estimation: (a) averaging estimators from local locations and (b) the one‐step approach, which combines the simple averaging estimator with a classical Newton's method (using the local Hessian matrices) to generate a “one‐step” estimator. It is proved that under certain assumptions, the above constructed estimators enjoy the same asymptotic properties as the centralized estimator, which is obtained as if all data were available at a central location. We review the aforementioned two major estimations. It can be seen that, in Big‐Data problems, dividing the data to multiple machines and then using the aggregation technique to solve the estimation problem in parallel can speed up the computation with little compromise of the quality of the estimators. We discuss potential extensions to other models, such as support vector machine, principle component analysis, and so on. Numerical examples are omitted due to the space limitation; they can be easily found in the literature. This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Knowledge Discovery Statistical Learning and Exploratory Methods of the Data Sciences > Modeling Methods Statistical Models > Fitting Models Statistical and Graphical Methods of Data Analysis > Modeling Methods and Algorithms

show abstract

“…With the consideration of sensor network applications, some DC methods have been proposed, such as a generic algorithm for distributed data clustering in sensor networks and the novel DKM algorithm for clustering observations collected by spatially distributed resource‐aware sensors . Recently, two K‐means‐based models, distributed PCA and K‐means and KPCA+ K‐means clustering, were developed based on the PCA concept and kernel PCA concept. Mashayekhi et al proposed GDCluster, a general fully decentralized clustering method, which is capable of clustering dynamic and distributed datasets .…”

Section: Data Mining Techniques In Distributed Environmentmentioning

confidence: 99%

Data mining in distributed environment: a survey

Gan

Lin

Chao

et al. 2017

WIREs Data Min & Knowl

121

View full text Add to dashboard Cite

Due to the rapid growth of resource sharing, distributed systems are developed, which can be used to utilize the computations. Data mining (DM) provides powerful techniques for finding meaningful and useful information from a very large amount of data, and has a wide range of real‐world applications. However, traditional DM algorithms assume that the data is centrally collected, memory‐resident, and static. It is challenging to manage the large‐scale data and process them with very limited resources. For example, large amounts of data are quickly produced and stored at multiple locations. It becomes increasingly expensive to centralize them in a single place. Moreover, traditional DM algorithms generally have some problems and challenges, such as memory limits, low processing ability, and inadequate hard disk, and so on. To solve the above problems, DM on distributed computing environment [also called distributed data mining (DDM)] has been emerging as a valuable alternative in many applications. In this study, a survey of state‐of‐the‐art DDM techniques is provided, including distributed frequent itemset mining, distributed frequent sequence mining, distributed frequent graph mining, distributed clustering, and privacy preserving of distributed data mining. We finally summarize the opportunities of data mining tasks in distributed environment. WIREs Data Mining Knowl Discov 2017, 7:e1216. doi: 10.1002/widm.1216 This article is categorized under: Application Areas > Business and Industry Fundamental Concepts of Data and Knowledge > Motivation and Emergence of Data Mining Technologies > Computer Architectures for Data Mining

show abstract

Communication Efficient Distributed Kernel Principal Component Analysis

Cited by 35 publications

References 21 publications

Big Data for weed control and crop protection

Big Data for weed control and crop protection

Aggregated inference

Data mining in distributed environment: a survey

Contact Info

Product

Resources

About