Canonicalization of database records using adaptive similarity measures

Culotta, Aron; Wick, Michael; Hall, Robert; Marzilli, Matthew; McCallum, Andrew

doi:10.1145/1281192.1281217

Cited by 17 publications

(14 citation statements)

References 11 publications

(6 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The name disambiguation methods proposed in the literature adopt a wide spectrum of solutions (Smalheiser & Torvik, 2009) that range from those based on supervised learning techniques (Han et al, 2004) to those that use some unsupervised or semi‐supervised clustering strategy (Bhattacharya & Getoor, 2006, 2007; Culotta et al, 2007; Han, Xu, et al, 2005; Han et al, 2005; Huang et al, 2006; On et al, 2005; Song et al, 2007; Torvik et al, 2005) or follow a graph‐oriented approach (Malin, 2005; On et al, 2006; On and Lee, 2007). In this section, we present a brief review of some representative name disambiguation methods.…”

Section: Related Workmentioning

confidence: 99%

“…The complexity of dealing with this problem has led to a myriad of proposals of methods and approaches for its solution (Bhattacharya & Getoor, 2006, 2007; Culotta, Kanani, Hall, Wick, & McCallum, 2007; Han, Giles, Zha, Li, & Tsioutsiouliklis, 2004; Han, Xu, Zha, & Giles, 2005; Han, Zha, & Giles, 2005; Huang, Ertekin, & Giles, 2006; Kang et al, 2009; Lee et al, 2005; Malin, 2005; On, Lee, Kang, & Mitra, 2005; On, Elmacioglu, Lee, Kang, & Pei, 2006; On & Lee, 2007; Soler, 2007; Song, Huang, Councill, Li, & Giles, 2007; Torvik, Weeber, Swanson, & Smalheiser, 2005; Torvik & Smalheiser, 2009; Treeratpituk & Giles, 2009). However, despite all these efforts, there is still a lot of room for improvement of the current solutions.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An unsupervised heuristic‐based hierarchical method for name disambiguation in bibliographic citations

Cota

Ferreira

Nascimento

et al. 2010

J. Am. Soc. Inf. Sci.

125

View full text Add to dashboard Cite

Name ambiguity in the context of bibliographic citations is a difficult problem which, despite the many efforts from the research community, still has a lot of room for improvement. In this article, we present a heuristic-based hierarchical clustering method to deal with this problem. The method successively fuses clusters of citations of similar author names based on several heuristics and similarity measures on the components of the citations (e.g., coauthor names, work title, and publication venue title). During the disambiguation task, the information about fused clusters is aggregated providing more information for the next round of fusion. In order to demonstrate the effectiveness of our method, we ran a series of experiments in two different collections extracted from real-world digital libraries and compared it, under two metrics, with four representative methods described in the literature. We present comparisons of results using each considered attribute separately (i.e., coauthor names, work title, and publication venue title) with the author name attribute and using all attributes together. These results show that our unsupervised method, when using all attributes, performs competitively against all other methods, under both metrics, loosing only in one case against a supervised method, whose result was very close to ours. Moreover, such results are achieved without the burden of any training and without using any privileged information such as knowing a priori the correct number of clusters.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

An unsupervised heuristic‐based hierarchical method for name disambiguation in bibliographic citations

Cota

Ferreira

Nascimento

et al. 2010

J. Am. Soc. Inf. Sci.

125

View full text Add to dashboard Cite

show abstract

“…This work is extended and applied successfully to record deduplication by Bilenko and Mooney [2]. Recently, Culotta et al [6] describe several methods for canonicalization of database records that are robust to noisy data and customizable to user preferences (e.g., a preference for acronyms versus full words).…”

Section: Canonicalizationmentioning

confidence: 99%

“…Even though we lack labeled data for canonicalization, we set these variables using a centroidbased approach with default settings for string edit parameters (insert, delete and substitute incur a penalty of one, and no penalty is given for copy). This method is shown in Culotta et al [6] to perform reasonably well and to capture coreference clustering C schema clustering S 2: while Not Converged do 3: C ⇐ GreedyAgglomerative(make-singletons(C), S) 4: S ⇐ GreedyAgglomerative(make-singletons(S), C) 5: end while many of the desirable properties of a canonical string. Even though we are able to achieve greater expressiveness in our model with cluster-wise first order features and high connectivity, we sacrifice the ability to apply exact inference and learning methods, since we cannot instantiate all of the Y variables.…”

Section: Modelmentioning

confidence: 99%

A unified approach for schema matching, coreference and canonicalization

Wick

Rohanimanesh

Schultz

et al. 2008

Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Self Cite

View full text Add to dashboard Cite

The automatic consolidation of database records from many heterogeneous sources into a single repository requires solving several information integration tasks. Although tasks such as coreference, schema matching, and canonicalization are closely related, they are most commonly studied in isolation. Systems that do tackle multiple integration problems traditionally solve each independently, allowing errors to propagate from one task to another. In this paper, we describe a discriminatively-trained model that reasons about schema matching, coreference, and canonicalization jointly. We evaluate our model on a real-world data set of people and demonstrate that simultaneously solving these tasks reduces errors over a cascaded or isolated approach. Our experiments show that a joint model is able to improve substantially over systems that either solve each task in isolation or with the conventional cascade. We demonstrate nearly a 50% error reduction for coreference and a 40% error reduction for schema matching.

show abstract

“…Perhaps the closest consideration lies in the problem of citation rectification. 16 The citation rectification problem addresses the fact that applicable logical rules can be ill-specified and subject to uncertainty at the signal measurement level. The way in which a name, publication venue, institution, document title, or date is expressed is subject to great stylistic variation and ambiguity in spelling, abbreviation, and omission.…”

mentioning

confidence: 99%

Scientific challenges underlying production document processing

Saund¹

2011

SPIE Proceedings

View full text Add to dashboard Cite

The Field of Document Recognition is bipolar. On one end lies the excellent work of academic institutions engaging in original research on scientifically interesting topics. On the other end lies the document recognition industry which services needs for high-volume data capture for transaction and back-office applications. These realms seldom meet, yet the need is great to address technical hurdles for practical problems using modern approaches from the Document Recognition, Computer Vision, and Machine Learning disciplines. We reflect on three categories of problems we have encountered which are both scientifically challenging and of high practical value. These are Doctype Classification, Functional Role Labeling, and Document Sets. Doctype Classification asks, "What is the type of page I am looking at?" Functional Role Labeling asks, "What is the status of text and graphical elements in a model of document structure?" Document Sets asks, "How are pages and their contents related to one another?" Each of these has ad hoc engineering approaches that provide 40-80% solutions, and each of them begs for a deeply grounded formulation both to provide understanding and to attain the remaining 20-60% of practical value. The practical need is not purely technical but also depends on the user experience in application setup and configuration, and in collection and groundtruthing of sample documents. The challenge therefore extends beyond the science behind document image recognition and into user interface and user experience design.

show abstract

Canonicalization of database records using adaptive similarity measures

Cited by 17 publications

References 11 publications

An unsupervised heuristic‐based hierarchical method for name disambiguation in bibliographic citations

An unsupervised heuristic‐based hierarchical method for name disambiguation in bibliographic citations

A unified approach for schema matching, coreference and canonicalization

Scientific challenges underlying production document processing

Contact Info

Product

Resources

About