A hierarchical naive Bayes mixture model for name disambiguation in author citations

Han, Hao; Xu, Wei; Zha, Hongyuan; Giles, C. Lee

doi:10.1145/1066677.1066920

Cited by 77 publications

(90 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…They use a mix of techniques. While some use similarity functions [2,7,12,18,21,27,30], others use learning techniques [1,14,16,28,32,35], heuristics [17,19,20,24], classifiers [9,10,34] and clustering methods [11,31].…”

Section: Background and Related Workmentioning

confidence: 99%

“…If no tuple is returned found, we attempt to retrieve the coauthor by the full name (lines 6-7), which is returned in case it is found (line 9). If there is no success and if the coauthor's name contains a period and/or a semicolon (which characterizes a citation name) (line 10), the heuristic tries to find the coauthor's CV by using the citation name (lines [11][12][13][14]. Since this query may retrieve several tuples, we use a similarity function to find the most similar coauthor in the database (lines 15-24).…”

Section: Heuristic Matching Algorithmmentioning

confidence: 99%

See 1 more Smart Citation

Detecting referential inconsistencies in electronic CV datasets

Rubim

Braganholo

2017

J Braz Comput Soc

View full text Add to dashboard Cite

One way to measure the scientific progress of a country is to evaluate the curriculum vitae (CV) of its researchers. In Brazil, this is not different. The Lattes Platform is an information system whose primary objective is to provide a single repository to store the CV of the Brazilian researchers. This system is increasingly acquiring expressiveness as the main source of information regarding the Brazilian community of researchers, students, managers, and other actors in the national system of science, technology, and innovation. However, the integrity of this important tool for gaging the national bibliographic production may be affected by the effect of ambiguities or referential inconsistencies in coauthoring citations. A first step towards solving this problem lies in identifying such inconsistencies. For that, we propose a heuristic-based approach that uses similarity search to match papers from coauthors of CV. We then use this technique to analyze over 2000 curricula of researchers from a given institution recovered from the Lattes Platform. The results indicate 18.98% of the analyzed publications present referential inconsistencies, which is a significant amount for a dataset that is supposed to be correct and trustable.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

Section: Heuristic Matching Algorithmmentioning

confidence: 99%

Detecting referential inconsistencies in electronic CV datasets

Rubim

Braganholo

2017

J Braz Comput Soc

View full text Add to dashboard Cite

show abstract

“…We can apply data mining classification methods, for example Bayes methods [23,18], decision trees [31] or SVM [7,11]. Unsupervised learning methods such as latent Dirichlet allocation [3] or clustering methods can also be used, if there is no training data.…”

Section: Related Workmentioning

confidence: 99%

Flexible and Efficient Distributed Resolution of Large Entities

Molnár

Benczúr

Sidló

2012

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Entity resolution (ER) is a computationally hard problem of data integration scenarios, where database records have to be grouped according to the real-world entities they belong to. In practice these entities may consist of only a few records from different data sources with typos or historical data. In other cases they may contain significantly more records, especially when we search for entities on a higher level of a concept hierarchy than records.In this paper we give theoretical foundation of a variety of practically important match functions. We show that under these formulations, ER with large entities can be solved efficiently with algorithms based on MapReduce, a distributed computing paradigm. Our algorithm can efficiently incorporate probabilistic and similarity-based record match, enabling flexible match function definition. We demonstrate the usability of our model and algorithm in a real-world insurance ER scenario, where we identify household groups of client records.

show abstract

“…To resolve this problem, some relational information is used to facilitate the disambiguation task. For example, Han et al [12] try to improve disambiguation accuracy by clustering title words and venue words with similar concepts. Song et al [13] introduce the relationships between authors and topics in citations to improve the disambiguation accuracy by extracting the wordbased relationships for each topic.…”

Section: Related Workmentioning

confidence: 99%

Author Name Disambiguation for Citations Using Topic and Web Correlation

Yang

Peng

Jiang

et al. 2008

Research and Advanced Technology for Digital Libraries

View full text Add to dashboard Cite

Abstract. Today, bibliographic digital libraries play an important role in helping members of academic community search for novel research. In particular, author disambiguation for citations is a major problem during the data integration and cleaning process, since author names are usually very ambiguous. For solving this problem, we proposed two kinds of correlations between citations, namely, Topic Correlation and Web Correlation, to exploit relationships between citations, in order to identify whether two citations with the same author name refer to the same individual. The topic correlation measures the similarity between research topics of two citations; while the Web correlation measures the number of co-occurrence in web pages. We employ a pair-wise grouping algorithm to group citations into clusters. The results of experiments show that the disambiguation accuracy has great improvement when using topic correlation and Web correlation, and Web correlation provides stronger evidences about the authors of citations.

show abstract

A hierarchical naive Bayes mixture model for name disambiguation in author citations

Cited by 77 publications

References 16 publications

Detecting referential inconsistencies in electronic CV datasets

Detecting referential inconsistencies in electronic CV datasets

Flexible and Efficient Distributed Resolution of Large Entities

Author Name Disambiguation for Citations Using Topic and Web Correlation

Contact Info

Product

Resources

About