Traditionally, provenance and lineage mainly referred to query results. We take a more holistic approach. We consider a system in which tuples (records) that are produced by a query may affect other tuple insertions into the DB, as part of a normal workflow. Therefore, we consider both direct lineage (dependence of a query result on database tuples directly used in solving the query) and distant lineage (dependence on older tuples that caused the existence
One can approximate the lineage of a Database (DB) tuple using a small set of low dimensional vectors. To identify actual lineage tuples using these vector sets, given a set of vectors (of the target tuple), one needs to locate "close" sets of vectors associated with the lineage tuples. We first consider a similarity measure between two sets π΄ and π΅ of vectors, that balances the average and maximum cosine distance between pairs of vectors, one from set π΄ and one from set π΅. The proposed similarity measure is intuitive and permutation invariant. To practically realize this measure, we need an approximate search algorithm that given a set of vectors π΄ and sets of vectors π΅ 1 , ..., π΅ π , the algorithm quickly locates the π-closest sets π΅ π 1 , ..., π΅ π π that maximize the similarity measure. For the case where all sets are singleton sets, essentially each is a single vector, there are known efficient approximate search algorithms, e.g., approximated versions of tree search algorithms, locality-sensitive hashing (LSH), vector quantization (VQ) and proximity graph algorithms. We utilize the mathematical properties of the cosine distance measure to transform the set-set search problem into a vector-vector search problem. However, this abovementioned transformation cannot handle the Euclidean-based version of the similarity measure. For this version, we devise a more elaborate transformation. For this latter transformation, we present algorithms for the general case, with sets of differing cardinalities. The underlying idea in both of these transformations is encoding a set of vectors π΄ via |π΄| "long" independent representative vectors. Then, we are able to transform the set-set search problem into the well-studied approximate (ordinary) vector search problem. For both cosine-based and Euclidean-based similarity measures, the proposed approximate search achieves significant performance gains over an optimized, exact search on vector sets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citationsβcitations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright Β© 2024 scite LLC. All rights reserved.
Made with π for researchers
Part of the Research Solutions Family.