Search citation statements
Paper Sections
Citation Types
Year Published
Publication Types
Relationship
Authors
Journals
Metric search techniques can be usefully characterised by the time at which distance calculations are performed during a query. Most exact search mechanisms use a "just-in-time" approach where distances are calculated as part of a navigational strategy. An alternative is to use a "one-time" approach, where distances to a fixed set of reference objects are calculated at the start of each query. These distances are typically used to re-cast data and queries into a different space where querying is more efficient, allowing an approximate solution to be obtained. In this paper we use a "one-time" approach for an exact search mechanism. A fixed set of reference objects is used to define a large set of regions within the original space, and each query is assessed with respect to the definition of these regions. Data is then accessed if, and only if, it is useful for the calculation of the query solution. As dimensionality increases, the number of defined regions must increase, but the memory required for the exclusion calculation does not. We show that the technique gives excellent performance over the SISAP benchmark data sets, and most interestingly we show how increases in dimensionality may be countered by relatively modest increases in the number of reference objects used. 1 Context To set a formal context, we are interested in searching a (large) finite set of objects S which is a subset of an infinite set U , where (U, d) is a metric space: that is, an ordered pair (U, d), where U is a domain of objects and d is a total distance function d : U ×U → R, satisfying postulates of non-negativity, identity, symmetry, and triangle inequality [20]. The general requirement is to efficiently find members of S which are similar to an arbitrary member of U given as a query, where the distance function d gives the only way by which any two objects may be compared. There are many important practical examples captured by this mathematical framework, see for example [16, 20]. The simplest type of similarity query is the range search query: for some threshold t, based on a query q ∈ U , the solution set is R = {s ∈ S| d(q, s) ≤ t}. The essence of metric search is to spend time pre-processing the finite set S so that solutions to queries can be efficiently calculated using only distances among objects. In all cases therefore, distances between the data and selected This is a post-peer-review, pre-copyedit version of a paper published in Marchand-Maillet S., Silva Y., Chávez E.
Metric search techniques can be usefully characterised by the time at which distance calculations are performed during a query. Most exact search mechanisms use a "just-in-time" approach where distances are calculated as part of a navigational strategy. An alternative is to use a "one-time" approach, where distances to a fixed set of reference objects are calculated at the start of each query. These distances are typically used to re-cast data and queries into a different space where querying is more efficient, allowing an approximate solution to be obtained. In this paper we use a "one-time" approach for an exact search mechanism. A fixed set of reference objects is used to define a large set of regions within the original space, and each query is assessed with respect to the definition of these regions. Data is then accessed if, and only if, it is useful for the calculation of the query solution. As dimensionality increases, the number of defined regions must increase, but the memory required for the exclusion calculation does not. We show that the technique gives excellent performance over the SISAP benchmark data sets, and most interestingly we show how increases in dimensionality may be countered by relatively modest increases in the number of reference objects used. 1 Context To set a formal context, we are interested in searching a (large) finite set of objects S which is a subset of an infinite set U , where (U, d) is a metric space: that is, an ordered pair (U, d), where U is a domain of objects and d is a total distance function d : U ×U → R, satisfying postulates of non-negativity, identity, symmetry, and triangle inequality [20]. The general requirement is to efficiently find members of S which are similar to an arbitrary member of U given as a query, where the distance function d gives the only way by which any two objects may be compared. There are many important practical examples captured by this mathematical framework, see for example [16, 20]. The simplest type of similarity query is the range search query: for some threshold t, based on a query q ∈ U , the solution set is R = {s ∈ S| d(q, s) ≤ t}. The essence of metric search is to spend time pre-processing the finite set S so that solutions to queries can be efficiently calculated using only distances among objects. In all cases therefore, distances between the data and selected This is a post-peer-review, pre-copyedit version of a paper published in Marchand-Maillet S., Silva Y., Chávez E.
We define BitPart (Bitwise representations of binary Partitions), a novel exact search mechanism intended for use in high-dimensional spaces. In outline, a fixed set of reference objects is used to define a large set of regions within the original space, and each data item is characterised according to its containment within these regions. In contrast with other mechanisms only a subset of this information is selected, according to the query, before a search within the recast space is performed. Partial data representations are accessed only if they are known to be potentially useful towards the calculation of the exact query solution. Our mechanism requires Ω(N log N) space to evaluate a query, where N is the cardinality of the data, and therefore does not scale as well as previously defined mechanisms with low-dimensional data. However it has recently been shown that, for a nearest neighbour search in high dimensions, a sequential scan of the data is essentially unavoidable. This result has been suspected for a long time, and has been referred to as the curse of dimensionality in this context. In the light of this result, the compromise achieved by this work is to make the best possible use of the available fast memory, and to offer great potential for parallel query evaluation. To our knowledge, it gives the best compromise currently known for performing exact search over data whose dimensionality is too high to allow the useful application of metric indexing, yet is still sufficiently low to give at least some traction from the metric and supermetric properties.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.