This paper considers various flavors of the following online problem: preprocess a text or collection of strings, so that given a query string p, all matches of p with the text can be reported quickly.In this paper we consider matches in which a bounded number of mismatches are allowed, or in which a bounded number of "don't care" characters are allowed.The specific problems we look at are: indexing, in which there is a single text t, and we seek locations where p matches a substring of t; dictionary queries, in which a collection of strings is given upfront, and we seek those strings which match p in their entirety; and dictionary matching, in which a collection of strings is given upfront, and we seek those substrings of a (long) p which match an original string in its entirety. These are all instances of an all-to-all matching problem, for which we provide a single solution.The performance bounds all have a similar character. For example, for the indexing problem with n = |t| and m = |p|, the query time for k substitutions is O(m + (c 1 log n) k k! + # matches), with a data structure of size O(n (c 2 log n) k k! ) and a preprocessing time of O(n (c 2 log n) k k!), where c1, c2 > 1 are constants. The deterministic preprocessing assumes a weakly nonuniform RAM model; this assumption is not needed if randomization is used in the preprocessing.
We present a new data structure that facilitates approximate nearest neighbor searches on a dynamic set of points in a metric space that has a bounded doubling dimension. Our data structure has linear size and supports insertions and deletions in O(log n) time, and finds a (1 + )-approximate nearest neighbor in time O(log n) + (1/ ) O(1) . The search and update times hide multiplicative factors that depend on the doubling dimension; the space does not. These performance times are independent of the aspect ratio (or spread) of the points.
For a set S of points in a metric space, a t-spanner is a graph on the points of S such that between any pair of points there is a path in the spanner whose total length is at most t times the actual distance between the points. In this paper, we consider points residing in a metric space of doubling dimension λ, and show how to construct a dynamic (1 + ε)-spanner with constant degree and O(log n) update time (when λ and ε are taken as constants). Our update time is optimal up to a constant.
Recent advances in large-margin classification of data residing in general metric spaces (rather than Hilbert spaces) enable classification under various natural metrics, such as string edit and earthmover distance. A general framework developed for this purpose by von Luxburg and Bousquet [JMLR, 2004] left open the questions of computational efficiency and of providing direct bounds on generalization error.We design a new algorithm for classification in general metric spaces, whose runtime and accuracy depend on the doubling dimension of the data points, and can thus achieve superior classification performance in many common scenarios. The algorithmic core of our approach is an approximate (rather than exact) solution to the classical problems of Lipschitz extension and of Nearest Neighbor Search. The algorithm's generalization performance is guaranteed via the fat-shattering dimension of Lipschitz classifiers, and we present experimental evidence of its superiority to some common kernel methods. As a by-product, we offer a new perspective on the nearest neighbor classifier, which yields significantly sharper risk asymptotics than the classic analysis of Cover and Hart [IEEE Trans. Info. Theory, 1967].
The Traveling Salesman Problem (TSP) is among the most famous NP-hard optimization problems. We design for this problem a randomized polynomial-time algorithm that computes a (1 + ε)-approximation to the optimal tour, for any fixed ε > 0, in TSP instances that form an arbitrary metric space with bounded intrinsic dimension.The celebrated results of Arora [Aro98] and Mitchell [Mit99] prove that the above result holds in the special case of TSP in a fixed-dimensional Euclidean space. Thus, our algorithm demonstrates that the algorithmic tractability of metric TSP depends on the dimensionality of the space and not on its specific geometry. This result resolves a problem that has been open since the quasi-polynomial time algorithm of Talwar [Tal04].
We present the first sample compression algorithm for nearest neighbors with non-trivial performance guarantees. We complement these guarantees by demonstrating almost matching hardness lower bounds, which show that our performance bound is nearly optimal. Our result yields new insight into margin-based nearest neighbor classification in metric spaces and allows us to significantly sharpen and simplify existing bounds. Some encouraging empirical results are also presented.
It has long been known that d-dimensional Euclidean point sets admit (1+ε)-stretch spanners with lightness W E = ε −O(d) , that is total edge weight at most W E times the weight of the minimum spaning tree of the set [DHN93]. Whether or not a similar result holds for metric spaces with low doubling dimension has remained an important open problem, and has resisted numerous attempts at resolution. In this paper, we resolve the question in the affirmative, and show that doubling spaces admit (1 + ε)-stretch spanners with lightness W D = (ddim /ε) O(ddim) . Important in its own right, our result also implies a much faster polynomial-time approximation scheme for the traveling salesman problem in doubling metric spaces, improving upon the bound presented in [BGK12]. * Ariel, Israel.
We study adaptive data-dependent dimensionality reduction in the context of supervised learning in general metric spaces. Our main statistical contribution is a generalization bound for Lipschitz functions in metric spaces that are doubling, or nearly doubling. On the algorithmic front, we describe an analogue of PCA for metric spaces: namely an efficient procedure that approximates the data's intrinsic dimension, which is often much lower than the ambient dimension. Our approach thus leverages the dual benefits of low dimensionality: (1) more efficient algorithms, e.g., for proximity search, and (2) more optimistic generalization bounds.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.