Efficient parallel set-similarity joins using MapReduce

Vernica, Rares; Carey, Michael J.; Li, Chen

doi:10.1145/1807167.1807222

Cited by 390 publications

(327 citation statements)

References 18 publications

Supporting

Mentioning

308

Contrasting

Unclassified

Order By: Relevance

“…Exact set similarity join has been extensively studied in the literature [18,19,20,21,28,31,32,33,34,35,40,44,45]. As we have introduced in Section 3, existing solutions all follow the filtering-verification framework and can be divided into two categories based on the filtering mechanism, namely, prefix-filter based algorithms and partition-filter based algorithms.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Leveraging set relations in exact set similarity join

et al. 2017

View full text Add to dashboard Cite

Exact set similarity join, which finds all the similar set pairs from two collections of sets, is a fundamental problem with a wide range of applications. The existing solutions for set similarity join follow a filtering-verification framework, which generates a list of candidate pairs through scanning indexes in the filtering phase, and reports those similar pairs in the verification phase. Though much research has been conducted on this problem, set relations, which we find out is quite effective on improving the algorithm efficiency through computational cost sharing, have never been studied. Therefore, in this paper, instead of considering each set individually, we explore the set relations in different levels to reduce the overall computational costs. First, it has been shown that most of the computational time is spent on the filtering phase, which can be quadratic to the number of sets in the worst case for the existing solutions. Thus we explore index-level set relations to reduce the filtering cost to be linear to the size of the input while keeping the same filtering power. We achieve this by grouping related sets into blocks in the index and skipping useless index probes in joins. Second, we explore answer-level set relations to further improve the algorithm based on the intuition that if two sets are similar, their answers may have a large overlap. We derive an algorithm which incrementally generates the answer of one set from an already computed answer of another similar set rather than compute the answer from scratch to reduce the computational cost. Finally, we conduct extensive performance studies using 21 real datasets with various data properties from a wide range of domains. The experimental results demonstrate that our algorithm outperforms all the existing algorithms across all datasets and can achieve more than an order of magnitude speedup against the stateof-the-art algorithms.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Other work focus on processing exact set similarity join in a distributed environment using MapReduce [33,35,40], which is not the focus of this paper.…”

Section: Related Workmentioning

confidence: 99%

Leveraging set relations in exact set similarity join

et al. 2017

View full text Add to dashboard Cite

show abstract

“…This work did not consider the specifics of MR and focused on parallel matching while blocking was not performed in parallel. In addition to our own approaches utilized in Dedoop, there are a few further proposals to employ MR for ER (e.g., [24,25]). These approaches do not support advanced features such as load balancing or redundancy-free multi-pass blocking.…”

Section: Related Workmentioning

confidence: 99%

“…We therefore vary the cluster size for entity resolution from 1 up to 100 nodes. We use the CiteseerX dataset 3 from [24] containing 1, 385, 532 publication records. To test Dedoop's load balancing, we apply both Sorted Neighborhood (SN) in combination with the RepSN algorithm [14] as well as Standard Blocking (SB) in combination with the PairRange algorithm [13].…”

Section: Scalabilitymentioning

confidence: 99%

Parallel Entity Resolution with Dedoop

Kolb

Rahm

2012

Datenbank Spektrum

View full text Add to dashboard Cite

We provide an overview of Dedoop (Deduplication with Hadoop), a new tool for parallel entity resolution (ER) on cloud infrastructures. Dedoop supports a browserbased specification of complex ER strategies and provides a large library of blocking and matching approaches. To simplify the configuration of ER strategies with several similarity metrics, training-based machine learning approaches can be employed with Dedoop. Specified ER strategies are automatically translated into MapReduce jobs for parallel execution on different Hadoop clusters. For improved performance, Dedoop supports redundancy-free multi-pass blocking as well as advanced load balancing approaches. To illustrate the usefulness of Dedoop, we present the results of a comparative evaluation of different ER strategies on a challenging real-world dataset.

show abstract

“…Many real-world tasks are expressible in this model. Programmers find the system easy to use and hundreds of MapReduce programs have been implemented; and around one thousand MapReduce jobs are executed on Google clusters every day [3,10,23]. For better understanding and simplicity, we tried to keep the MapReduce explanation figures as simple as possible.…”

Section: Introductionmentioning

confidence: 99%

Finding Key Persons on Social Media by Using MapReduce Skyline

Zaman

Siddique

Annisa

et al. 2017

IJNC

View full text Add to dashboard Cite

This study considers the problem of selecting a small number of important persons from social media. Skyline query has been utilized for selecting key persons. Based on certain criteria from social media, this query selects persons who are not dominated by any other. Owing to the complex structure of social media, selecting a key person is more complicated and its application is quite different from conventional skyline queries. We need to consider various metrics in the social media. In addition, social media contains massive data, and the data increase is huge. It is collection of online communication channels dedicated to community-based inputs, interactions, content sharing, and collaboration. We use MapReduce framework to speed up the computation and introduce parallelism in the processing. An extensive set of experiments shows that the analysis of social activities, social relationships, and socially shared contents helps finding a key person. The experimental results also confirm the efficiency and scalability of our algorithm on a synthetic dataset.

show abstract

Efficient parallel set-similarity joins using MapReduce

Cited by 390 publications

References 18 publications

Leveraging set relations in exact set similarity join

Leveraging set relations in exact set similarity join

Parallel Entity Resolution with Dedoop

Finding Key Persons on Social Media by Using MapReduce Skyline

Contact Info

Product

Resources

About