The clustering problem, in its many variants, has numerous applications in operations research and computer science (e.g., in applications in bioinformatics, image processing, social network analysis, etc.). As sizes of data sets have grown rapidly, researchers have focused on designing algorithms for clustering problems in models of computation suited for large-scale computation such as MapReduce, Pregel, and streaming models. The k-machine model (Klauck et al., SODA 2015) is a simple, message-passing model for large-scale distributed graph processing. This paper considers three of the most prominent examples of clustering problems: the uncapacitated facility location problem, the p-median problem, and the pcenter problem and presents O(1)-factor approximation algorithms for these problems running inÕ(n/k) rounds in the k-machine model. These algorithms are optimal up to polylogarithmic factors because this paper also showsΩ(n/k) lower bounds for obtaining polynomial-factor approximation algorithms for these problems. These are the first results for clustering problems in the k-machine model.We assume that the metric provided as input for these clustering problems in only implicitly provided, as an edge-weighted graph and in a nutshell, our main technical contribution is to show that constantfactor approximation algorithms for all three clustering problems can be obtained by learning only a small portion of the input metric.1 problems in the recently proposed k-machine model [21], a synchronous, message-passing model for largescale distributed computation. This model cleanly abstracts essential features of systems such as Pregel [24] and Giraph (see http://giraph.apache.org/) that have been designed for large-scale graph processing 1 , allowing researchers to prove precise upper and lower bounds. One of the main features of the k-machine model is that the input, consisting of n items, is randomly partitioned across k machines. Of particular interest are settings in which n is much larger than k. Communication occurs via bandwidth-restricted communication links between every pair of machines and thus the underlying communication network is a size-k clique. For all three problems, we present constant-factor approximation algorithms that run iñ O(n/k) rounds in the k-machine model. We also show that these algorithms have optimal round complexity, to within polylogarithmic factors, by providing complementaryΩ(n/k) lower bounds for polynomial-factor approximation algorithms 2 . These are the first results on clustering problems in the k-machine model.
Objective:To determine whether Clostridioides difficile infection (CDI) exhibits spatiotemporal interaction and clustering.Design:Retrospective observational study.Setting:The University of Iowa Hospitals and Clinics.Patients:This study included 1,963 CDI cases, January 2005 through December 2011.Methods:We extracted location and time information for each case and ran the Knox, Mantel, and mean and maximum component size tests for time thresholds (T = 7, 14, and 21 days) and distance thresholds (D = 2, 3, 4, and 5 units; 1 unit = 5–6 m). All tests were implemented using Monte Carlo simulations, and random CDI cases were constructed by randomly permuting times of CDI cases 20,000 times. As a counterfactual, we repeated all tests on 790 aspiration pneumonia cases because aspiration pneumonia is a complication without environmental factors.Results:Results from the Knox test and mean component size test rejected the null hypothesis of no spatiotemporal interaction (P < .0001), for all values of T and D. Results from the Mantel test also rejected the hypothesis of no spatiotemporal interaction (P < .0003). The same tests showed no such effects for aspiration pneumonia. Our results from the maximum component size tests showed similar trends, but they were not consistently significant, possibly because CDI outbreaks attributable to the environment were relatively small.Conclusion:Our results clearly show spatiotemporal interaction and clustering among CDI cases and none whatsoever for aspiration pneumonia cases. These results strongly suggest that environmental factors play a role in the onset of some CDI cases. However, our results are not inconsistent with the possibility that many genetically unrelated CDI cases occurred during the study period.
We prove three new lower bounds for graph connectivity in the 1-bit broadcast congested clique model, BCC(1). First, in the KT-0 version of BCC(1), in which nodes are aware of neighbors only through port numbers, we show an Ω(log n) round lower bound for Connectivity even for constant-error randomized Monte Carlo algorithms. The deterministic version of this result can be obtained via the well-known "edge-crossing" argument, but, the randomized version of this result requires establishing new combinatorial results regarding the indistinguishability graph induced by inputs. In our second result, we show that the Ω(log n) lower bound result extends to the KT-1 version of the BCC(1) model, in which nodes are aware of IDs of all neighbors, though our proof works only for deterministic algorithms. Since nodes know IDs of their neighbors in the KT-1 model, it is no longer possible to play "edge-crossing" tricks; instead we present a reduction from the 2-party communication complexity problem Partition in which Alice and Bob are given two set partitions on [n] and are required to determine if the join of these two set partitions equals the trivial one-part set partition. While our KT-1 Connectivity lower bound holds only for deterministic algorithms, in our third result we extend this Ω(log n) KT-1 lower bound to constant-error Monte Carlo algorithms for the closely related ConnectedComponents problem. We use information-theoretic techniques to obtain this result. All our results hold for the seemingly easy special case of Connectivity in which an algorithm has to distinguish an instance with one cycle from an instance with multiple cycles. Our results showcase three rather different lower bound techniques and lay the groundwork for further improvements in lower bounds for Connectivity in the BCC(1) model. * A short version of this paper has appeared as a brief announcement in PODC 2019. 1 We use "w.h.p." as short for "with high probability" which refers to the probability that is at least 1 − 1/n c for c ≥ 1. arXiv:1905.09016v1 [cs.DC] 22 May 2019 for Connectivity in the BCC(log n) model, due to Jurdziński and Nowicki [JN17], is deterministic and it runs in O log n log log n rounds. This contrast between BCC(b) and CC(b) is not surprising, given how much larger the overall bandwidth in CC(b) is compared to BCC(b). Becker et al. [Bec+16]show that the pair-wise set disjointness problem can be solved in O(1) rounds in CC (1), but needs Ω(n) rounds in BCC(1). But, despite the fact that Connectivity is such a fundamental problem, no non-trivial lower bound is known for Connectivity in BCC(1). In fact, prior to this paper, we could not even rule out an O(1)-round Connectivity algorithm in BCC(1).Lower bound arguments in "congested" distributed computing models typically use a "bottleneck" technique [CKP17; CK18; DS+11; DKO14; Fis+18; HP15]. At a high level, this technique consists of showing that there is a low bandwidth cut in the communication network across which a high volume of information has to flow in order to solve...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.