A Practical and Effective Sampling Selection Strategy for Large Scale Deduplication

Bianco, Guilherme Dal; Galante, Renata; Gonçalves, Marcos André; Canuto, Sérgio D.; Heuser, Carlos A.

doi:10.1109/tkde.2015.2416734

Cited by 24 publications

(11 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…A traditional method of producing data blocks is defining one of the attributes as the blocking criterion or key [4]. Only records that have the same blocking key will be inserted into the same block, reducing substantially the number of comparisons.…”

Section: Blocking Methodsmentioning

confidence: 99%

“…However, such a method results in a quadratic cost of processing which would be non-practical in the case of a large dataset. In this context, blocking appears as an alternative to reduce the search space by only processing the records that have some indication of representing a duplicate [4].…”

Section: Related Workmentioning

confidence: 99%

“…Thus, it is important for the blocking process that represents the largest processing slice [4], to be efficient enough to not result in processing delays.…”

Section: Related Workmentioning

confidence: 99%

“…This initial experimentation illustrates that Redblock is able to promote a deduplication technique with high efficiency, finding all duplicate pairs as well as the classification method 4 Table 7 states the efficiency of Redblock spouts and bolts regarding time consumed to be executed. More specifically, it shows the average time that a tuple spends to be executed by each step.…”

Section: Effectivenessmentioning

confidence: 99%

See 3 more Smart Citations

Redblock: a tool for online deduplication on large datasets

Pimentel¹,

Vicente²,

Bianco³

2017

RBCA

Self Cite

View full text Add to dashboard Cite

Online data deduplication aims to identify records that represent the same purpose on a continuous data flow environment. It must be able to process a range of information with high effectiveness and no delays. The purpose of this paper is to introduce a developed tool entitled Redblock, for real-time data deduplication, using a distributed platform for online processing combined with an Inverted Index. During the experimental evaluation, Redblock managed to provide good preliminary results in terms of efficiency and effectiveness in a database.Keywords: Data Integration, Online Deduplication, Blocking.Resumo: A deduplicação online tem como propósito identificar registros que representam um mesmo objetivo em ambientes com fluxo contínuo de dados. A deduplicação online deve ser capaz de processar volumes variados de informações, sem atrasos e com uma alta eficácia. Este trabalho, propõe uma ferramenta intitulada Redblock para a deduplicação de dados em tempo real. A ferramenta utiliza uma plataforma distribuída de processamento online em conjunto com um mé-todo de blocagem utilizando índice invertido. Na experimentação, Redblock demonstrou resultados preliminares promissores em relação a eficácia e a eficiência em uma base de dados.

show abstract

Section: Blocking Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

“…Thus, it is important for the blocking process that represents the largest processing slice [4], to be efficient enough to not result in processing delays.…”

Section: Related Workmentioning

confidence: 99%

Section: Effectivenessmentioning

confidence: 99%

See 2 more Smart Citations

Redblock: a tool for online deduplication on large datasets

Pimentel¹,

Vicente²,

Bianco³

2017

RBCA

Self Cite

View full text Add to dashboard Cite

show abstract

“…Instead, we may consider direct matching of place names, which can be done using deep neural networks [3,15,19,30,40,45,49]. However, these models are known to be more suitable for longer sentences with more complex structures, rather than the short simple place names, and they require large amounts of training data, which is extremely hard to acquire for place deduplication [7,9,31]. Finally, by considering additional place attributes like location, ad-hoc models have been developed based on heuristic feature engineering [8,10,22], but their flexibility is limited to take more different place attributes, such as address and category.…”

Section: Nlp: Entity Resolutionmentioning

confidence: 99%

Place Deduplication with Embeddings

Yang

Hoang

Mikolov

2019

The World Wide Web Conference

View full text Add to dashboard Cite

Thanks to the advancing mobile location services, people nowadays can post about places to share visiting experience on-the-go. A large place graph not only helps users explore interesting destinations, but also provides opportunities for understanding and modeling the real world. To improve coverage and flexibility of the place graph, many platforms import places data from multiple sources, which unfortunately leads to the emergence of numerous duplicated places that severely hinder subsequent location-related services. In this work, we take the anonymous place graph from Facebook as an example to systematically study the problem of place deduplication: We carefully formulate the problem, study its connections to various related tasks that lead to several promising basic models, and arrive at a systematic two-step data-driven pipeline based on place embedding with multiple novel techniques that works significantly better than the state-of-the-art.

show abstract