Abstract:Data deduplication is process of discovering multiple representations of same entity in an information system. Blocking has been a benchmark technique for avoiding the pair-wise record comparisons in data deduplication. Standard blocking (SB) aims at putting the potential duplicate records in the same block on the basis of a blocking key. Afterwards, the detailed comparisons are made only among the records residing in the same block. The selection of blocking key is a tedious process that involves exponential … Show more
“…However, NLP also negatively affects the efficiency of the blocking task as a whole since it is necessary to consult the word vector for each token. Thus, among the possible strategies to handle noisy data, LSH is the most promising in terms of results [20,32,38].…”
Section: Blocking In the Noisy-data Contextmentioning
confidence: 99%
“…Following the BLAST idea, the work in [32] applies LSH in order to hash the attribute values and enable the generation of high-quality blocks (i.e., blocks that contain a significant number of entities with high chances of being considered similar/matches), even with the presence of noise in the attribute values. In [38], the Locality-Sensitive Blocking (LSB) strategy is proposed. LSB applies LSH to standard blocking techniques in order to group similar entities without requiring the selection of blocking keys.…”
Section: Blocking In the Noisy-data Contextmentioning
confidence: 99%
“…In contrast to the previously mentioned works, our work not only focuses on noisy data but also allows the proposed incremental blocking technique to handle streaming and noisy data simultaneously. Although the works in [20,32,38] do not explore aspects such as incremental processing or streaming data, the idea behind the application of LSH to minimize the negative effects of noisy data on the blocking techniques can also be applied to the proposed technique to expand its applications. Therefore, this work adapts the application of LSH (in blocking techniques) to the contexts of distributed computing, incremental processing, and streaming data.…”
Section: Blocking In the Noisy-data Contextmentioning
Web systems have become a valuable source of semi-structured and streaming data. In this sense, Entity Resolution (ER) has become a key solution for integrating multiple data sources or identifying similarities between data items, namely entities. To avoid the quadratic costs of the ER task and improve efficiency, blocking techniques are usually applied. Beyond the traditional challenges faced by ER and, consequently, by the blocking techniques, there are also challenges related to streaming data, incremental processing, and noisy data. To address them, we propose a schema-agnostic blocking technique capable of handling noisy and streaming data incrementally through a distributed computational infrastructure. To the best of our knowledge, there is a lack of blocking techniques that address these challenges simultaneously. This work proposes two strategies (attribute selection and top-n neighborhood entities) to minimize resource consumption and improve blocking efficiency. Moreover, this work presents a noise-tolerant algorithm, which minimizes the impact of noisy data (e.g., typos and misspellings) on blocking effectiveness. In our experimental evaluation, we use real-world pairs of data sources, including a case study that involves data from Twitter and Google News. The proposed technique achieves better results regarding effectiveness and efficiency compared to the state-of-the-art technique (metablocking). More precisely, the application of the two strategies over the proposed technique alone improves efficiency by 56%, on average.
“…However, NLP also negatively affects the efficiency of the blocking task as a whole since it is necessary to consult the word vector for each token. Thus, among the possible strategies to handle noisy data, LSH is the most promising in terms of results [20,32,38].…”
Section: Blocking In the Noisy-data Contextmentioning
confidence: 99%
“…Following the BLAST idea, the work in [32] applies LSH in order to hash the attribute values and enable the generation of high-quality blocks (i.e., blocks that contain a significant number of entities with high chances of being considered similar/matches), even with the presence of noise in the attribute values. In [38], the Locality-Sensitive Blocking (LSB) strategy is proposed. LSB applies LSH to standard blocking techniques in order to group similar entities without requiring the selection of blocking keys.…”
Section: Blocking In the Noisy-data Contextmentioning
confidence: 99%
“…In contrast to the previously mentioned works, our work not only focuses on noisy data but also allows the proposed incremental blocking technique to handle streaming and noisy data simultaneously. Although the works in [20,32,38] do not explore aspects such as incremental processing or streaming data, the idea behind the application of LSH to minimize the negative effects of noisy data on the blocking techniques can also be applied to the proposed technique to expand its applications. Therefore, this work adapts the application of LSH (in blocking techniques) to the contexts of distributed computing, incremental processing, and streaming data.…”
Section: Blocking In the Noisy-data Contextmentioning
Web systems have become a valuable source of semi-structured and streaming data. In this sense, Entity Resolution (ER) has become a key solution for integrating multiple data sources or identifying similarities between data items, namely entities. To avoid the quadratic costs of the ER task and improve efficiency, blocking techniques are usually applied. Beyond the traditional challenges faced by ER and, consequently, by the blocking techniques, there are also challenges related to streaming data, incremental processing, and noisy data. To address them, we propose a schema-agnostic blocking technique capable of handling noisy and streaming data incrementally through a distributed computational infrastructure. To the best of our knowledge, there is a lack of blocking techniques that address these challenges simultaneously. This work proposes two strategies (attribute selection and top-n neighborhood entities) to minimize resource consumption and improve blocking efficiency. Moreover, this work presents a noise-tolerant algorithm, which minimizes the impact of noisy data (e.g., typos and misspellings) on blocking effectiveness. In our experimental evaluation, we use real-world pairs of data sources, including a case study that involves data from Twitter and Google News. The proposed technique achieves better results regarding effectiveness and efficiency compared to the state-of-the-art technique (metablocking). More precisely, the application of the two strategies over the proposed technique alone improves efficiency by 56%, on average.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.