Similarity join on high-dimensional data is a primitive operation. It is used to find all data pairs that with distance no more than from the given data set according to a specific distance measure.As the data set scale and dimension increase, computation cost increases vastly. Hadoop and Spark have become the popular platforms for big-data analysis. Because Spark has native advantages in iterative computations, we adopted it as our platform to perform similarity joins on high-dimensional data sets. In order to resolve problems such as data imbalance, data duplication, and redundant computation of existing works, we have proposed a new algorithm based on Symbolic aggregation and vertical decomposition. We first conduct dimension-reduction using symbolic aggregation method. Then, we applied vertical partition operation on processed data.The join operations are performed on each vertical partition in parallel manner and the proposed new filters are utilized to prune false positives in early stage. Finally, the partial results generated from each partition will be aggregated and verified to get final results. Our proposed algorithm can significantly improve the efficiency of similarity joins on high-dimensional data. In order to verify the efficiency and scalability of our methods, we implemented it using MapReduce and Spark. We compared our methods with existing works on public data sets, and the experimental results showed that the new methods were more efficient and scalable under different running environments. KEYWORDShigh-dimensional data, piecewise aggregation, similarity join, symbolic aggregation, Spark, vertical partition INTRODUCTIONIn this era of big data, data-acquisition occurs ever more quickly, the scale of data is increasing rapidly, and the types of data are complex and diverse. This brings new challenges to data analysis and processing. As a basic operation, the similarity join has been applied widely in many fields, such as friend recommendations, 1 pattern recognition, 2 clustering, 3 image similarity matching, 4 outlier detection, 5 and spatial databases. 6 A similarity join is essentially a pair of comparisons. It has high computationally complexity and is intensive. Data-processing time increases exponentially with data volume. To improve the execution efficiency of algorithms, more efficient methods are needed to reduce unnecessary operations in large-scale data processing. Most traditional algorithms use a spatial index, such as a B+tree, R-tree, or z-order curve, to improve the performance of a similarity join, but traditional algorithms do not apply to large-scale, high-dimensional dalg:1ata sets. The current MapReduce 7 framework based on Hadoop has emerged as the primary choice for big-data processing. MapReduce is a programming model that can easily develop scalable parallel applications for large-scale data. For computationally intensive operations such as a similarity join, scholars recently have proposed parallel knn-join algorithms using MapReduce, such as H-BNLJ, H-BRJ, 8 and PGBJ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.