Xiaohai Cheng scite author profile

Xiaohai Cheng

1Publication

5Citation Statements Received

30Citation Statements Given

How they've been cited

How they cite others

Affiliations

Tianjin Polytechnic University

Publications

Order By: Most citations

Similarity joins for high‐dimensional data using Spark

Rong

Cheng

Chen

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Similarity join on high-dimensional data is a primitive operation. It is used to find all data pairs that with distance no more than from the given data set according to a specific distance measure.As the data set scale and dimension increase, computation cost increases vastly. Hadoop and Spark have become the popular platforms for big-data analysis. Because Spark has native advantages in iterative computations, we adopted it as our platform to perform similarity joins on high-dimensional data sets. In order to resolve problems such as data imbalance, data duplication, and redundant computation of existing works, we have proposed a new algorithm based on Symbolic aggregation and vertical decomposition. We first conduct dimension-reduction using symbolic aggregation method. Then, we applied vertical partition operation on processed data.The join operations are performed on each vertical partition in parallel manner and the proposed new filters are utilized to prune false positives in early stage. Finally, the partial results generated from each partition will be aggregated and verified to get final results. Our proposed algorithm can significantly improve the efficiency of similarity joins on high-dimensional data. In order to verify the efficiency and scalability of our methods, we implemented it using MapReduce and Spark. We compared our methods with existing works on public data sets, and the experimental results showed that the new methods were more efficient and scalable under different running environments. KEYWORDShigh-dimensional data, piecewise aggregation, similarity join, symbolic aggregation, Spark, vertical partition INTRODUCTIONIn this era of big data, data-acquisition occurs ever more quickly, the scale of data is increasing rapidly, and the types of data are complex and diverse. This brings new challenges to data analysis and processing. As a basic operation, the similarity join has been applied widely in many fields, such as friend recommendations, 1 pattern recognition, 2 clustering, 3 image similarity matching, 4 outlier detection, 5 and spatial databases. 6 A similarity join is essentially a pair of comparisons. It has high computationally complexity and is intensive. Data-processing time increases exponentially with data volume. To improve the execution efficiency of algorithms, more efficient methods are needed to reduce unnecessary operations in large-scale data processing. Most traditional algorithms use a spatial index, such as a B+tree, R-tree, or z-order curve, to improve the performance of a similarity join, but traditional algorithms do not apply to large-scale, high-dimensional dalg:1ata sets. The current MapReduce 7 framework based on Hadoop has emerged as the primary choice for big-data processing. MapReduce is a programming model that can easily develop scalable parallel applications for large-scale data. For computationally intensive operations such as a similarity join, scholars recently have proposed parallel knn-join algorithms using MapReduce, such as H-BNLJ, H-BRJ, 8 and PGBJ...

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Xiaohai Cheng

Similarity joins for high‐dimensional data using Spark

Contact Info

Product

Resources

About