Themis Palpanas scite author profile

In the context of entity resolution (ER) in highly heterogeneous, noisy, user-generated entity collections, practically all block building methods employ redundancy to achieve high effectiveness. This practice, however, results in a high number of pairwise comparisons, with a negative impact on efficiency. Existing block processing strategies aim at discarding unnecessary comparisons at no cost in effectiveness. In this paper, we systemize blocking methods for clean-clean ER (an inherently quadratic task) over highly heterogeneous information spaces (HHIS) through a novel framework that consists of two orthogonal layers: the effectiveness layer encompasses methods for building overlapping blocks with small likelihood of missed matches; the efficiency layer comprises a rich variety of techniques that significantly restrict the required number of pairwise comparisons, having a controllable impact on the number of detected duplicates. We map to our framework all relevant existing methods for creating and processing blocks in the context of HHIS, and additionally propose two novel techniques: attribute clustering blocking and comparison scheduling. We evaluate the performance of each layer and method on two large-scale, real-world data sets and validate the excellent balance between efficiency and effectiveness that they achieve.

show abstract

Meta-Blocking: Taking Entity Resolutionto the Next Level

Papadakis

Koutrika

Palpanas

et al. 2014

IEEE Trans. Knowl. Data Eng.

113

179

View full text Add to dashboard Cite

Survey on mining subjective data on the web

Tsytsarau

Palpanas

2011

Data Min Knowl Disc

350

178

View full text Add to dashboard Cite

iSAX 2.0: Indexing and Mining One Billion Time Series

et al. 2010

View full text Add to dashboard Cite

Abstract-There is an increasingly pressing need, by several applications in diverse domains, for developing techniques able to index and mine very large collections of time series. Examples of such applications come from astronomy, biology, the web, and other domains. It is not unusual for these applications to involve numbers of time series in the order of hundreds of millions to billions. However, all relevant techniques that have been proposed in the literature so far have not considered any data collections much larger than onemillion time series. In this paper, we describe iSAX 2.0, a data structure designed for indexing and mining truly massive collections of time series. We show that the main bottleneck in mining such massive datasets is the time taken to build the index, and we thus introduce a novel bulk loading mechanism, the first of this kind specifically tailored to a time series index. We show how our method allows mining on datasets that would otherwise be completely untenable, including the first published experiments to index one billion time series, and experiments in mining massive data from domains as diverse as entomology, DNA and web-scale image collections.

show abstract

Beyond one billion time series: indexing and mining very large time series collections with $$i$$ SAX2+

et al. 2013

View full text Add to dashboard Cite

Comparative analysis of approximate blocking techniques for entity resolution

et al. 2016

View full text Add to dashboard Cite

Entity Resolution is a core task for merging data collections. Due to its quadratic complexity, it typically scales to large volumes of data through blocking: similar entities are clustered into blocks and pair-wise comparisons are executed only between co-occurring entities, at the cost of some missed matches. There are numerous blocking methods, and the aim of this work is to offer a comprehensive empirical survey, extending the dimensions of comparison beyond what is commonly available in the literature. We consider 17 state-of-the-art blocking methods and use 6 popular real datasets to examine the robustness of their internal configurations and their relative balance between effectiveness and time efficiency. We also investigate their scalability over a corpus of 7 established synthetic datasets that range from 10,000 to 2 million entities.

show abstract

ADS: the adaptive data series index

2016

View full text Add to dashboard Cite

Online amnesic approximation of streaming time series

Palpanas

Vlachos

Keogh

et al.

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Themis Palpanas

A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces

Meta-Blocking: Taking Entity Resolutionto the Next Level

Survey on mining subjective data on the web

iSAX 2.0: Indexing and Mining One Billion Time Series

Beyond one billion time series: indexing and mining very large time series collections with $$i$$ SAX2+

Comparative analysis of approximate blocking techniques for entity resolution

ADS: the adaptive data series index

Online amnesic approximation of streaming time series

Contact Info

Product

Resources

About