Scalable Matching and Clustering of Entities with FAMER

Saeedi, Alieh; Nentwig, Markus; Peukert, Eric; Rahm, Erhard

doi:10.7250/csimq.2018-16.04

Cited by 26 publications

(30 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the simplest case, Connected Components [80,153] is applied to compute the transitive closure of the detected matches. This naive approach increases recall, but is rather sensitive to noise.…”

Section: Clustering Methodsmentioning

confidence: 99%

“…The final task in the end-to-end ER workflow is Clustering [80,126,[153][154][155], which groups together the identified matches such that all descriptions within a cluster match. Its goal is actually to infer indirect matching relations among the detected pairs of matching descriptions so as to overcome possible limitations of the employed similarity functions.…”

Section: Q3mentioning

confidence: 99%

See 1 more Smart Citation

An Overview of End-to-End Entity Resolution for Big Data

Efthymiou²,

et al. 2020

View full text Add to dashboard Cite

One of the most critical tasks for improving data quality and increasing the reliability of data analytics is Entity Resolution (ER), which aims to identify different descriptions that refer to the same real-world entity. Despite several decades of research, ER remains a challenging problem. In this survey, we highlight the novel aspects of resolving Big Data entities when we should satisfy more than one of the Big Data characteristics simultaneously (i.e., Volume and Velocity with Variety). We present the basic concepts, processing steps, and execution strategies that have been proposed by database, semantic Web, and machine learning communities in order to cope with the loose structuredness , extreme diversity , high speed, and large scale of entity descriptions used by real-world applications. We provide an end-to-end view of ER workflows for Big Data, critically review the pros and cons of existing methods, and conclude with the main open research directions.

show abstract

Section: Clustering Methodsmentioning

confidence: 99%

Section: Q3mentioning

confidence: 99%

An Overview of End-to-End Entity Resolution for Big Data

Efthymiou²,

et al. 2020

View full text Add to dashboard Cite

show abstract

“…For Dirty ER, the simplest approach is Connected Components [31,32], which sets a cut-off threshold t and considers as matches all comparisons with a similarity score higher than t; then, it estimates the transitive closure of the matches. For higher robustness to noise, more advanced algorithms build clusters around selected entities that operate as centers.…”

Section: ) Entity Clustering (Ecmentioning

confidence: 99%

“…The first category includes the open-source tools that are crafted for structured data, namely Magellan [3], Dedupe [68], DuDe [69], Febrl [65], FRIL [70], OYSTER [71], Record Linkage [72] and FAMER [32]. All of them apply a budget-agnostic, schema-based end-to-end workflow that typically consists of two steps: Blocking and Matching.…”

Section: Related Workmentioning

confidence: 99%

“…LogMap [78] logic-based constraints to exclude comparisons -ISUB [79] FAMER [32] SB, SN, Q-Grams -Jaro-Winkler, TruncateBegin, TruncateEnd, EditDistance, MongeElkan, Jaccard, DICE, Overlap ExtendedJaccard, Longest Common Substring, Numerical Similarity Max Distance, Numerical Similarity Max Percentage (a) Systems for structured data KnoFuss [73] Literal Blocking -edit-distance (DATE, DiceCoefficient, Jaccard, Jaro, JaroWinkler, Overlap, MongeElkan, SmithWaterman, TokenBased, TokenWise) SERIMI [74] logic-based constraints to exclude comparisons -n-gram based [80] SB -Cosine, Jaccard (c) Systems for both structured and semi-structured data Table 6: Technical features of the main open-source ER systems. LB stands for Learning-based, LF for learning-free, C-C for Clean-Clean ER and D for Dirty ER.…”

Section: Blockingmentioning

confidence: 99%

See 1 more Smart Citation

Three-dimensional Entity Resolution with JedAI

Papadakis

Mandilaras

Gagliardelli

et al. 2020

Information Systems

View full text Add to dashboard Cite

Entity Resolution (ER) is the task of detecting different entity profiles that describe the same real-world objects. To facilitate its execution, we have developed JedAI, an open-source system that puts together a series of state-of-the-art ER techniques that have been proposed and examined independently, targeting parts of the ER end-to-end pipeline. This is a unique approach, as no other ER tool brings together so many established techniques. Instead, most ER tools merely convey a few techniques, those primarily developed by their creators. In addition to democratizing ER techniques, JedAI goes beyond the other ER tools by offering a series of unique characteristics: (i) It allows for building and benchmarking millions of ER pipelines. (ii) It is the only ER system that applies seamlessly to any combination of structured and/or semi-structured data. (iii) It constitutes the only ER system that runs seamlessly both on stand-alone computers and clusters of computers-through the parallel implementation of all algorithms in Apache Spark. (iv) It supports two different end-to-end workflows for carrying out batch ER (i.e., budget-agnostic), a schema-agnostic one based on blocks, and a schema-based one relying on similarity joins. (v) It adapts both end-to-end workflows to budget-aware (i.e., progressive) ER. We present in detail all features of JedAI, stressing the core characteristics that enhance its usability, and boost its versatility and effectiveness. We also compare it to the state-of-the-art in the field, qualitatively and quantitatively, demonstrating its state-of-the-art performance over a variety of large-scale datasets from different domains.

show abstract

User Profile Linkage Across Multiple Social Platforms

Wang

Chen

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Scalable Matching and Clustering of Entities with FAMER

Cited by 26 publications

References 28 publications

An Overview of End-to-End Entity Resolution for Big Data

An Overview of End-to-End Entity Resolution for Big Data

Three-dimensional Entity Resolution with JedAI

User Profile Linkage Across Multiple Social Platforms

Contact Info

Product

Resources

About