On Generating Benchmark Data for Entity Matching

Ioannou, Ekaterini; Rassadko, Nataliya; Velegrakis, Yannis

doi:10.1007/s13740-012-0015-8

Cited by 26 publications

(20 citation statements)

References 66 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In our experiments, we used the DBpedia (BTC12DBpedia) and Freebase (BTC12Freebase) datasets from BTC12, and the raw infoboxes from DBpedia 3.5 (Infoboxes), i.e., two different versions of DBpedia. We also included a movies dataset 7 , used in [15], extracted from DBpedia movies and IMDB, to validate the correctness of our algorithms.…”

Section: A Datasetsmentioning

confidence: 99%

“…However, these algorithms have not yet been experimentally evaluated with Linked Open Data (LOD) datasets exhibiting different characteristics in terms of the underlying number of entity types and size of entity descriptions (in terms of property-value pairs), as well as their structural (i.e., property vocabularies) and semantic (i.e., common property values and URLs) overlap. Existing works in ER benchmarks [7] and evaluation frameworks [11] focus on the similarity of descriptions and how these similarities affect the matching decision of ER; not on blocking, explicitly. Their data variations (focusing on highly similar descriptions) are not adequate to evaluate blocking algorithms suitable for the Web of data.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Big data entity resolution: From highly to somehow similar entity descriptions in the Web

Efthymiou

Stefanidis

Christophides

2015

2015 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

In the Web of data, entities are described by interlinked data rather than documents on the Web. In this work, we focus on entity resolution in the Web of data, i.e., identifying descriptions that refer to the same real-world entity. To reduce the required number of pairwise comparisons, methods for entity resolution perform blocking as a pre-processing step. A blocking technique places similar entity descriptions into blocks and executes comparisons only between descriptions within the same block. We experimentally evaluate blocking techniques proposed for the Web of data and present dataset characteristics that determine the effectiveness and efficiency of such methods. Furthermore, we analyze the characteristics of the missed matching entity descriptions and examine different types of links that blocking techniques can potentially identify.

show abstract

Section: A Datasetsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Big data entity resolution: From highly to somehow similar entity descriptions in the Web

Efthymiou

Stefanidis

Christophides

2015

2015 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

show abstract

“…Given a set of entity references, such as publication venue titles, entity resolution is the process of identifying which of them correspond to the same real-world entity [20]. In a recent survey on entity resolution (or entity matching), [21] presents an implementation of a framework for evaluating entity matching systems through a systematic generation of synthetic test cases. Other surveys and tutorials on entity resolution can be found in [22], [23], [24], and [25].…”

Section: Related Workmentioning

confidence: 99%

Disambiguating publication venue titles using association rules

Pereira

Silva

Esmin

2014

IEEE/ACM Joint Conference on Digital Libraries

View full text Add to dashboard Cite

Research agencies in several countries evaluate the impact of scientific publications of researcher groups to define their investments, and one of the main used metrics is the quality of the publication venues where their works were published. Several bibliometric indexes have been formulated by measuring the quality of a publication venue. However, given a set of citations extracted, for example, from curricula vitae of a researcher group, to effectively use bibliometric indexes to evaluate their quality it is necessary to identify correctly the publication venue title of each citation. This task is not easy, since there are not unique identifiers for publication venues. Frequently, citations contain abbreviated forms and acronyms, publication venues share similar titles, sometimes they change their titles, divide or merge, creating new ones. Traditional digital libraries deal with this problem by creating Authority Files. In this work, we present a twofold contribution: (i) the creation of a Computer Science publication venue authority file and (ii) the proposal of a method that uses association rules to disambiguate publication venue titles originated from citations. The disambiguator is a supervised learning method that uses the authority file to train a classifier, whose generated model is a set of association rules to identify publication venues. Experiments show that our method obtains better results than three state of art baselines.

show abstract

“…Unfortunately, for Big Data this is not a feasible solution since the level that the likelihood is considered high is not clear, plus, different situations may require different likelihood thresholds. The Big Data group platform is able to provide flexible on-the-fly integration [5] that depending on the items of interests, decides what needs to be integrated and what now. This mode is more suitable for Big Data since no a-priori decisions need to be made, yet, the level of complexity is highly increasing, which makes the task particularly challenging.…”

Section: The Platform Featuresmentioning

confidence: 99%