Similarity joins for high‐dimensional data using Spark

Rong, Chuitian; Cheng, Xiaohai; Chen, Ziliang; Huo, Na

doi:10.1002/cpe.5339

Cited by 7 publications

(8 citation statements)

References 23 publications

(30 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compared with Hadoop based solutions, Chen et al [32] has faster computing speed and better scalability. Rong et al [33] proposed a new similarity join algorithm called symbolic aggregation and vertical decomposition(SAVD) using Spark.…”

Section: B Vector Similarity Joinmentioning

confidence: 99%

Projection Based Large Scale High-Dimensional Data Similarity Join Using MapReduce Framework

Zhang

Cui

et al. 2020

IEEE Access

View full text Add to dashboard Cite

Similarity join has been widely used in many data analysis and data mining applications, we mainly focus on the scalability and performance problem of similarity join query on massive highdimensional data set. p-stable distribution based projection scheme can implement dimension reduction effectively. Three novel approaches based on projection scheme are proposed to deal with massive highdimensional data similarity join problem: Single projection method, Multiple projection method and Projection space partitioning method. Comprehensive experimental tests were performed to evaluate the performance of the above approaches. The experimental results show that the proposed approaches in this paper have good performance and scalability.

show abstract

Section: B Vector Similarity Joinmentioning

confidence: 99%

Projection Based Large Scale High-Dimensional Data Similarity Join Using MapReduce Framework

Zhang

Cui

et al. 2020

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Several works have explored the efficient way to perform set similarity joins. The parallel theta join using MapReduce to join two data sets like in relational databases is explored by Okcan and Riedewald and by Zhang et al Similarity joins on high dimensional data using Spark are studied exploiting data representation and vertical partition techniques . Our approach, however, aims to find correlated segments from two time series.…”

Section: Related Workmentioning

confidence: 99%

“…The parallel theta join using MapReduce to join two data sets like in relational databases is explored by Okcan and Riedewald 26 and by Zhang et al 26,27 Similarity joins on high dimensional data using Spark are studied exploiting data representation and vertical partition techniques. 28 Our approach, however, aims to find correlated segments from two time series. Bendre and Manthalkar 29 introduced predictive analytics approaches The FFT algorithm is widely used in many industries for data transformation.…”

Section: Related Workmentioning

confidence: 99%

Parallel time series join using spark

Rong

Chen

Silva

2019

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Summary A time series is a sequence of data points in successive temporal order. Time series data is produced in many applications scenarios, and the techniques for its analysis have generated substantial interest. Time series join is a primitive operation that retrieves all pairs of correlated subsequences from two given time series. As the Pearson correlation coefficient, a measure of the correlation between two variables, has multiple beneficial mathematical properties, for example, the fact that it is invariant with respect to scale and offset, it is used to measure the correlation between two time series. Considering the need to analyze big time series data, we focus on the study of scalable and distributed techniques to process massive data sets. Specifically, we propose a parallel approach to perform time series joins using Spark, a popular analytics engine for large‐scale data processing. Our solution builds on (1) a fast method to compute the fast Fourier transform on the times series to calculate the correlation between two time series, (2) a lossless partition method to divide the time series into multiple subsequences and enable a parallel and correct computation of the join result, and (3) optimization techniques to avoid redundant computations. We performed extensive tests and showed that the proposed approach is efficient and scalable across different data sets and test configurations.

show abstract

“…They adopted dimension reduction with symbolic aggregation method and vertical partition operation to implement join operations for high‐dimensional data. Spark caches the intermediate results to reduce data input or output so that the efficiency of iteration is improved . However, how to select the right RDDs to cache the partitions in limited memory is an open issue.…”

Section: Related Workmentioning

confidence: 99%

“…Spark caches the intermediate results to reduce data input or output so that the efficiency of iteration is improved. 29,33 However, how to select the right RDDs to cache the partitions in limited memory is an open issue. The shuffle operations are still required for the iterative applications.…”

Section: Related Work Comparisonsmentioning

confidence: 99%

Performance enhancement for iterative data computing with in‐memory concurrent processing

Wen

Chen

Chiu

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Summary The big data era has resulted in the development of several data analysis tools. Spark is a type of in‐memory processing fitted iteration and interactive data mining tool. This tool possesses higher data‐processing performance than MapReduce, which is an offline storage mechanism. However, some disadvantages of in‐memory processing, such as massive in‐memory data requirements, cause cross‐node data transfer that result in a long computation time. The performance of the process can be improved if the in‐memory process is executed with fewer shuffle instructions. Therefore, this study aims to enhance the performance of iterative application through instruction replacement. Three empirical research cases with diverse datasets and iterations are used to modify the program. We adopt a strategy of downloading a small resilient distributed dataset and replacing the shuffle‐included instructions to shorten the processing time with an automated code replacement by using exhaustively code matching. The experimental results reveal an improvement of up to 39% in the execution time compared with the existing in‐memory processing programs with various dataset sizes.

show abstract

Similarity joins for high‐dimensional data using Spark

Cited by 7 publications

References 23 publications

Projection Based Large Scale High-Dimensional Data Similarity Join Using MapReduce Framework

Projection Based Large Scale High-Dimensional Data Similarity Join Using MapReduce Framework

Parallel time series join using spark

Performance enhancement for iterative data computing with in‐memory concurrent processing

Contact Info

Product

Resources

About