2019
DOI: 10.1002/cpe.5339
|View full text |Cite
|
Sign up to set email alerts
|

Similarity joins for high‐dimensional data using Spark

Abstract: Similarity join on high-dimensional data is a primitive operation. It is used to find all data pairs that with distance no more than from the given data set according to a specific distance measure.As the data set scale and dimension increase, computation cost increases vastly. Hadoop and Spark have become the popular platforms for big-data analysis. Because Spark has native advantages in iterative computations, we adopted it as our platform to perform similarity joins on high-dimensional data sets. In order t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
5
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 7 publications
(8 citation statements)
references
References 23 publications
(30 reference statements)
0
5
0
Order By: Relevance
“…Compared with Hadoop based solutions, Chen et al [32] has faster computing speed and better scalability. Rong et al [33] proposed a new similarity join algorithm called symbolic aggregation and vertical decomposition(SAVD) using Spark.…”
Section: B Vector Similarity Joinmentioning
confidence: 99%
“…Compared with Hadoop based solutions, Chen et al [32] has faster computing speed and better scalability. Rong et al [33] proposed a new similarity join algorithm called symbolic aggregation and vertical decomposition(SAVD) using Spark.…”
Section: B Vector Similarity Joinmentioning
confidence: 99%
“…Several works have explored the efficient way to perform set similarity joins. The parallel theta join using MapReduce to join two data sets like in relational databases is explored by Okcan and Riedewald and by Zhang et al Similarity joins on high dimensional data using Spark are studied exploiting data representation and vertical partition techniques . Our approach, however, aims to find correlated segments from two time series.…”
Section: Related Workmentioning
confidence: 99%
“…The parallel theta join using MapReduce to join two data sets like in relational databases is explored by Okcan and Riedewald 26 and by Zhang et al 26,27 Similarity joins on high dimensional data using Spark are studied exploiting data representation and vertical partition techniques. 28 Our approach, however, aims to find correlated segments from two time series. Bendre and Manthalkar 29 introduced predictive analytics approaches The FFT algorithm is widely used in many industries for data transformation.…”
Section: Related Workmentioning
confidence: 99%
“…They adopted dimension reduction with symbolic aggregation method and vertical partition operation to implement join operations for high‐dimensional data. Spark caches the intermediate results to reduce data input or output so that the efficiency of iteration is improved . However, how to select the right RDDs to cache the partitions in limited memory is an open issue.…”
Section: Related Workmentioning
confidence: 99%
“…Spark caches the intermediate results to reduce data input or output so that the efficiency of iteration is improved. 29,33 However, how to select the right RDDs to cache the partitions in limited memory is an open issue. The shuffle operations are still required for the iterative applications.…”
Section: Related Work Comparisonsmentioning
confidence: 99%