Hierarchical clustering is a recursive partitioning of a dataset into clusters at an increasingly finer granularity. Motivated by the fact that most work on hierarchical clustering was based on providing algorithms, rather than optimizing a specific objective, Dasgupta (2016) framed similarity-based hierarchical clustering as a combinatorial optimization problem, where a 'good' hierarchical clustering is one that minimizes some cost function. He showed that this cost function has certain desirable properties, such as in order to achieve optimal cost disconnected components must be separated first and that in 'structureless' graphs, i.e., cliques, all clusterings achieve the same cost.We take an axiomatic approach to defining 'good' objective functions for both similarity and dissimilarity-based hierarchical clustering. We characterize a set of admissible objective functions (that includes the one introduced by Dasgupta) that have the property that when the input admits a 'natural' ground-truth hierarchical clustering, the ground-truth clustering has an optimal value.Equipped with a suitable objective function, we analyze the performance of practical algorithms, as well as develop better and faster algorithms for hierarchical clustering. For similaritybased hierarchical clustering, Dasgupta (2016) showed that a simple recursive sparsest-cut based approach achieves an Oplog 3{2 nq-approximation on worst-case inputs. We give a more refined analysis of the algorithm and show that it in fact achieves an Op ? log nq-approximation 1 . This improves upon the LP-based Oplog nq-approximation of Roy and Pokutta (2016). For dissimilarity-based hierarchical clustering, we show that the classic average-linkage algorithm gives a factor 2 approximation, and provide a simple and better algorithm that gives a factor 3{2 approximation. This aims at explaining the success of this heuristics in practice. Finally, we consider 'beyond-worst-case' scenario through a generalisation of the stochastic block model for hierarchical clustering. We show that Dasgupta's cost function also has desirable properties for these inputs and we provide a simple algorithm that for graphs generated according to this model yields a 1 + o(1) factor approximation.
International audienceWe consider the problem of deterministic load balancing of tokens in the discrete model. A set of n processors is connected into a d-regular undirected network. In every time step, each processor exchanges some of its tokens with each of its neighbors in the network. The goal is to minimize the discrepancy between the number of tokens on the most-loaded and the least-loaded processor as quickly as possible. Rabani et al. (1998) present a general technique for the analysis of a wide class of discrete load balancing algorithms. Their approach is to characterize the deviation between the actual loads of a discrete balancing algorithm with the distribution generated by a related Markov chain. The Markov chain can also be regarded as the underlying model of a continuous diffusion algorithm. Rabani et al. showed that after time T = O(log (Kn)/μ), any algorithm of their class achieves a discrepancy of O(d log n/μ), where μ is the spectral gap of the transition matrix of the graph, and K is the initial load discrepancy in the system.In this work we identify some natural additional conditions on deterministic balancing algorithms, resulting in a class of algorithms reaching a smaller discrepancy. This class contains well-known algorithms, e.g., the rotor-router. Specifically, we introduce the notion of cumulatively fair load-balancing algorithms where in any interval of consecutive time steps, the total number of tokens sent out over an edge by a node is the same (up to constants) for all adjacent edges. We prove that algorithms which are cumulatively fair and where every node retains a sufficient part of its load in each step, achieve a discrepancy of O(d√log n/μ ,d√n) in time O(T). We also show that in general neither of these assumptions may be omitted without increasing discrepancy. We then show by a combinatorial potential reduction argument that any cumulatively fair scheme satisfying some additional assumptions achieves a discrepancy of O(d) almost as quickly as the continuous diffusion process. This positive result applies to some of the simplest and most natural discrete load balancing schemes
Coalescing random walks is a fundamental stochastic process, where a set of particles perform independent discrete-time random walks on an undirected graph. Whenever two or more particles meet at a given node, they merge and continue as a single random walk. The coalescence time is defined as the expected time until only one particle remains, starting from one particle at every node. Despite recent progress such as by Cooper, Elsässer, Ono, Radzik [CEOR13] and Cooper, Frieze and Radzik [CFR09a], the coalescence time for graphs such as binary trees, d-dimensional tori, hypercubes and more generally, vertex-transitive graphs, remains unresolved.We provide a powerful toolkit that results in tight bounds for various topologies including the aforementioned ones. The meeting time is defined as the worst-case expected time required for two random walks to arrive at the same node at the same time. As a general result, we establish that for graphs whose meeting time is only marginally larger than the mixing time (a factor of log 2 n), the coalescence time of n random walks equals the meeting time up to constant factors. This upper bound is complemented by the construction of a graph family demonstrating that this result is the best possible up to constant factors. For almost-regular graphs, we bound the coalescence time by the hitting time, resolving the discrete-time variant of a conjecture by Aldous for this class of graphs. Finally, we prove that for any graph the coalescence time is bounded by O(n 3 ) (which is tight for the Barbell graph); surprisingly even such a basic question about the coalescing time was not answered before this work. By duality, our results give bounds on the voter model and therefore give bounds on the consensus time in arbitrary undirected graphs.We also establish a new bound on the hitting time and cover time of regular graphs, improving and tightening previous results by Broder and Karlin [BK89], as well as those by Aldous and Fill [AF02].
A fundamental problem in distributed computing is the distribution of requests to a set of uniform servers without a centralized controller. Classically, such problems are modeled as static balls into bins processes, where m balls (tasks) are to be distributed among n bins (servers). In a seminal work, Azar et al. [4] proposed the sequential strategy Greedy[d] for n = m. Each ball queries the load of d random bins and is allocated to a least loaded of them. Azar et al. showed that d = 2 yields an exponential improvement compared to d = 1. Berenbrink et al. [7] extended this to m n, showing that for d = 2 the maximal load difference is independent of m (in contrast to the d = 1 case). We propose a new variant of an infinite balls-into-bins process. In each round an expected number of λn new balls arrive and are distributed (in parallel) to the bins. Subsequently, each non-empty bin deletes one of its balls. This setting models a set of servers processing incoming requests, where clients can query a server's current load but receive no information * A preliminary version of this paper was published in the proceedings of PODC'16 † The author did some of the research while affiliated with Simon Fraser University. 1 about parallel requests. We study the Greedy[d] distribution scheme in this setting and show a strong self-stabilizing property: for any arrival rate λ = λ(n) < 1, the system load is time-invariant. Moreover, for any (even super-exponential) round t, the maximum system load is (w.h.p.) O 1 1−λ • log n 1−λ for d = 1 and O log n 1−λ for d = 2. In particular, Greedy[2] has an exponentially smaller system load for high arrival rates.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.