Abstract:Because of name variations, an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of document retrieval, web search, database integration, and may cause improper attribution to authors. This paper presents a hierarchical naive Bayes mixture model, an unsupervised learning approach, for name disambiguation in author citations. This method partitions a collection of citations 1 into clusters, with each cluster containing only citations authore… Show more
“…They use a mix of techniques. While some use similarity functions [2,7,12,18,21,27,30], others use learning techniques [1,14,16,28,32,35], heuristics [17,19,20,24], classifiers [9,10,34] and clustering methods [11,31].…”
Section: Background and Related Workmentioning
confidence: 99%
“…If no tuple is returned found, we attempt to retrieve the coauthor by the full name (lines 6-7), which is returned in case it is found (line 9). If there is no success and if the coauthor's name contains a period and/or a semicolon (which characterizes a citation name) (line 10), the heuristic tries to find the coauthor's CV by using the citation name (lines [11][12][13][14]. Since this query may retrieve several tuples, we use a similarity function to find the most similar coauthor in the database (lines 15-24).…”
One way to measure the scientific progress of a country is to evaluate the curriculum vitae (CV) of its researchers. In Brazil, this is not different. The Lattes Platform is an information system whose primary objective is to provide a single repository to store the CV of the Brazilian researchers. This system is increasingly acquiring expressiveness as the main source of information regarding the Brazilian community of researchers, students, managers, and other actors in the national system of science, technology, and innovation. However, the integrity of this important tool for gaging the national bibliographic production may be affected by the effect of ambiguities or referential inconsistencies in coauthoring citations. A first step towards solving this problem lies in identifying such inconsistencies. For that, we propose a heuristic-based approach that uses similarity search to match papers from coauthors of CV. We then use this technique to analyze over 2000 curricula of researchers from a given institution recovered from the Lattes Platform. The results indicate 18.98% of the analyzed publications present referential inconsistencies, which is a significant amount for a dataset that is supposed to be correct and trustable.
“…They use a mix of techniques. While some use similarity functions [2,7,12,18,21,27,30], others use learning techniques [1,14,16,28,32,35], heuristics [17,19,20,24], classifiers [9,10,34] and clustering methods [11,31].…”
Section: Background and Related Workmentioning
confidence: 99%
“…If no tuple is returned found, we attempt to retrieve the coauthor by the full name (lines 6-7), which is returned in case it is found (line 9). If there is no success and if the coauthor's name contains a period and/or a semicolon (which characterizes a citation name) (line 10), the heuristic tries to find the coauthor's CV by using the citation name (lines [11][12][13][14]. Since this query may retrieve several tuples, we use a similarity function to find the most similar coauthor in the database (lines 15-24).…”
One way to measure the scientific progress of a country is to evaluate the curriculum vitae (CV) of its researchers. In Brazil, this is not different. The Lattes Platform is an information system whose primary objective is to provide a single repository to store the CV of the Brazilian researchers. This system is increasingly acquiring expressiveness as the main source of information regarding the Brazilian community of researchers, students, managers, and other actors in the national system of science, technology, and innovation. However, the integrity of this important tool for gaging the national bibliographic production may be affected by the effect of ambiguities or referential inconsistencies in coauthoring citations. A first step towards solving this problem lies in identifying such inconsistencies. For that, we propose a heuristic-based approach that uses similarity search to match papers from coauthors of CV. We then use this technique to analyze over 2000 curricula of researchers from a given institution recovered from the Lattes Platform. The results indicate 18.98% of the analyzed publications present referential inconsistencies, which is a significant amount for a dataset that is supposed to be correct and trustable.
“…We can apply data mining classification methods, for example Bayes methods [23,18], decision trees [31] or SVM [7,11]. Unsupervised learning methods such as latent Dirichlet allocation [3] or clustering methods can also be used, if there is no training data.…”
Abstract. Entity resolution (ER) is a computationally hard problem of data integration scenarios, where database records have to be grouped according to the real-world entities they belong to. In practice these entities may consist of only a few records from different data sources with typos or historical data. In other cases they may contain significantly more records, especially when we search for entities on a higher level of a concept hierarchy than records.In this paper we give theoretical foundation of a variety of practically important match functions. We show that under these formulations, ER with large entities can be solved efficiently with algorithms based on MapReduce, a distributed computing paradigm. Our algorithm can efficiently incorporate probabilistic and similarity-based record match, enabling flexible match function definition. We demonstrate the usability of our model and algorithm in a real-world insurance ER scenario, where we identify household groups of client records.
“…To resolve this problem, some relational information is used to facilitate the disambiguation task. For example, Han et al [12] try to improve disambiguation accuracy by clustering title words and venue words with similar concepts. Song et al [13] introduce the relationships between authors and topics in citations to improve the disambiguation accuracy by extracting the wordbased relationships for each topic.…”
Abstract. Today, bibliographic digital libraries play an important role in helping members of academic community search for novel research. In particular, author disambiguation for citations is a major problem during the data integration and cleaning process, since author names are usually very ambiguous. For solving this problem, we proposed two kinds of correlations between citations, namely, Topic Correlation and Web Correlation, to exploit relationships between citations, in order to identify whether two citations with the same author name refer to the same individual. The topic correlation measures the similarity between research topics of two citations; while the Web correlation measures the number of co-occurrence in web pages. We employ a pair-wise grouping algorithm to group citations into clusters. The results of experiments show that the disambiguation accuracy has great improvement when using topic correlation and Web correlation, and Web correlation provides stronger evidences about the authors of citations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.