Using Monte Carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses

Schulz, Jan

doi:10.1007/s11192-016-1892-7

Cited by 29 publications

(25 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, such distortive effects of ambiguous bibliographic data have been discussed for bibliometrics in general as well as network measures (e.g., Schulz, 2016;Strotmann & Zhao, 2012;van den Besselaar & Sandström, 2016). First, scholars should be warned that author name ambiguity can be detrimental to the study of collaboration networks by generating merged and/or split nodal entities.…”

Section: Conclusion and Discussionmentioning

confidence: 99%

Scale‐free collaboration networks: An author name disambiguation perspective

Kim

2019

Asso for Info Science & Tech

View full text Add to dashboard Cite

Several studies have found that collaboration networks are scale-free, proposing that such networks can be modeled by specific network evolution mechanisms like preferential attachment. This study argues that collaboration networks can look more or less scale-free depending on the methods for resolving author name ambiguity in bibliographic data. Analyzing networks constructed from multiple datasets containing 3.4 M 9.6 M publication records, this study shows that collaboration networks in which author names are disambiguated by the commonly used heuristic, i.e., forename-initial-based name matching, tend to produce degree distributions better fitted to power-law slopes with the typical scaling parameter (2 < α < 3) than networks disambiguated by more accurate algorithm-based methods. Such tendency is observed across collaboration networks generated under various conditions such as cumulative years, 5and 1-year sliding windows, and random sampling, and through simulation, found to arise due mainly to artefactual entities created by inaccurate disambiguation. This cautionary study calls for special attention from scholars analyzing network data in which entities such as people, organization, and gene can be merged or split by improper disambiguation.

show abstract

Section: Conclusion and Discussionmentioning

confidence: 99%

Scale‐free collaboration networks: An author name disambiguation perspective

Kim

2019

Asso for Info Science & Tech

View full text Add to dashboard Cite

show abstract

“…A challenge is that if we use many features, we cannot distinguish the impact of different positive-negative training data ratios from the impact of feature effectiveness. So, we tried to select a minimum set of featurescoauthor names and title wordswhich are commonly used in most disambiguation studies and have been found to be effective in disambiguating names (Ferreira et al, 2012;Schulz, 2016;Wang et al, 2012). Another reason is that these two features are available across all labeled datasets used in this study, while other features such as affiliation, journal names, and references are recorded in some data but not in another.…”

Section: Machine Learning Settingsmentioning

confidence: 99%

The impact of imbalanced training data on machine learning for author name disambiguation

Kim

2018

Scientometrics

View full text Add to dashboard Cite

In supervised machine learning for author name disambiguation, negative training data are often dominantly larger than positive training data. This paper examines how the ratios of negative to positive training data can affect the performance of machine learning algorithms to disambiguate author names in bibliographic records. On multiple labeled datasets, three classifiers -Logistic Regression, Naïve Bayes, and Random Forestare trained through representative features such as coauthor names, and title words extracted from the same training data but with various positive-negative training data ratios. Results show that increasing negative training data can improve disambiguation performance but with a few percent of performance gains and sometimes degrade it. Logistic Regression and Naïve Bayes learn optimal disambiguation models even with a base ratio (1:1) of positive and negative training data. Also, the performance improvement by Random Forest tends to quickly saturate roughly after 1:10 ~ 1:15. These findings imply that contrary to the common practice using all training data, name disambiguation algorithms can be trained using part of negative training data without degrading much disambiguation performance while increasing computational efficiency. This study calls for more attention from author name disambiguation scholars to methods for machine learning from imbalanced data.

show abstract

“…This challenge also holds at the disciplinary level, as investigated for Chemistry, Physics, Medicine, and Economics and Business by Harzing (2015). Even if the problem would be less pronounced at the specialty level, results may still benefit from more advanced disambiguation methods (Schulz, 2016).…”

Section: Fa-a(1)mentioning

confidence: 99%

Bibliometric approximation of a scientific specialty by combining key sources, title words, authors and references

Rons¹

2018

Journal of Informetrics

View full text Add to dashboard Cite

Bibliometric methods for the analysis of highly specialized subjects are increasingly investigated and debated. Information and assessments well-focused at the specialty level can help make important decisions in research and innovation policy. This paper presents a novel method to approximate the specialty to which a given publication record belongs. The method partially combines sets of key values for four publication data fields: source, title, authors and references. The approach is founded in concepts defining research disciplines and scholarly communication, and in empirically observed regularities in publication data. The resulting specialty approximation consists of publications associated to the investigated publication record via key values for at least three of the four data fields. This paper describes the method and illustrates it with an application to publication records of individual scientists. The illustration also successfully tests the focus of the specialty approximation in terms of its ability to connect and help identify peers. Potential tracks for further investigation include analyses involving other kinds of specialized publication records, studies for a broader range of specialties, and exploration of the potential for diverse applications in research and research policy context.

show abstract

Using Monte Carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses

Cited by 29 publications

References 52 publications

Scale‐free collaboration networks: An author name disambiguation perspective

Scale‐free collaboration networks: An author name disambiguation perspective

The impact of imbalanced training data on machine learning for author name disambiguation

Bibliometric approximation of a scientific specialty by combining key sources, title words, authors and references

Contact Info

Product

Resources

About