Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2007
DOI: 10.1145/1281192.1281217
|View full text |Cite
|
Sign up to set email alerts
|

Canonicalization of database records using adaptive similarity measures

Abstract: It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from papers and their references. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a singl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2008
2008
2022
2022

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 17 publications
(14 citation statements)
references
References 11 publications
(6 reference statements)
0
9
0
Order By: Relevance
“…The name disambiguation methods proposed in the literature adopt a wide spectrum of solutions (Smalheiser & Torvik, 2009) that range from those based on supervised learning techniques (Han et al, 2004) to those that use some unsupervised or semi‐supervised clustering strategy (Bhattacharya & Getoor, 2006, 2007; Culotta et al, 2007; Han, Xu, et al, 2005; Han et al, 2005; Huang et al, 2006; On et al, 2005; Song et al, 2007; Torvik et al, 2005) or follow a graph‐oriented approach (Malin, 2005; On et al, 2006; On and Lee, 2007). In this section, we present a brief review of some representative name disambiguation methods.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The name disambiguation methods proposed in the literature adopt a wide spectrum of solutions (Smalheiser & Torvik, 2009) that range from those based on supervised learning techniques (Han et al, 2004) to those that use some unsupervised or semi‐supervised clustering strategy (Bhattacharya & Getoor, 2006, 2007; Culotta et al, 2007; Han, Xu, et al, 2005; Han et al, 2005; Huang et al, 2006; On et al, 2005; Song et al, 2007; Torvik et al, 2005) or follow a graph‐oriented approach (Malin, 2005; On et al, 2006; On and Lee, 2007). In this section, we present a brief review of some representative name disambiguation methods.…”
Section: Related Workmentioning
confidence: 99%
“…The complexity of dealing with this problem has led to a myriad of proposals of methods and approaches for its solution (Bhattacharya & Getoor, 2006, 2007; Culotta, Kanani, Hall, Wick, & McCallum, 2007; Han, Giles, Zha, Li, & Tsioutsiouliklis, 2004; Han, Xu, Zha, & Giles, 2005; Han, Zha, & Giles, 2005; Huang, Ertekin, & Giles, 2006; Kang et al, 2009; Lee et al, 2005; Malin, 2005; On, Lee, Kang, & Mitra, 2005; On, Elmacioglu, Lee, Kang, & Pei, 2006; On & Lee, 2007; Soler, 2007; Song, Huang, Councill, Li, & Giles, 2007; Torvik, Weeber, Swanson, & Smalheiser, 2005; Torvik & Smalheiser, 2009; Treeratpituk & Giles, 2009). However, despite all these efforts, there is still a lot of room for improvement of the current solutions.…”
Section: Introductionmentioning
confidence: 99%
“…This work is extended and applied successfully to record deduplication by Bilenko and Mooney [2]. Recently, Culotta et al [6] describe several methods for canonicalization of database records that are robust to noisy data and customizable to user preferences (e.g., a preference for acronyms versus full words).…”
Section: Canonicalizationmentioning
confidence: 99%
“…Even though we lack labeled data for canonicalization, we set these variables using a centroidbased approach with default settings for string edit parameters (insert, delete and substitute incur a penalty of one, and no penalty is given for copy). This method is shown in Culotta et al [6] to perform reasonably well and to capture coreference clustering C schema clustering S 2: while Not Converged do 3: C ⇐ GreedyAgglomerative(make-singletons(C), S) 4: S ⇐ GreedyAgglomerative(make-singletons(S), C) 5: end while many of the desirable properties of a canonical string. Even though we are able to achieve greater expressiveness in our model with cluster-wise first order features and high connectivity, we sacrifice the ability to apply exact inference and learning methods, since we cannot instantiate all of the Y variables.…”
Section: Modelmentioning
confidence: 99%
“…Perhaps the closest consideration lies in the problem of citation rectification. 16 The citation rectification problem addresses the fact that applicable logical rules can be ill-specified and subject to uncertainty at the signal measurement level. The way in which a name, publication venue, institution, document title, or date is expressed is subject to great stylistic variation and ambiguity in spelling, abbreviation, and omission.…”
mentioning
confidence: 99%