Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference

Sebo, Paul

doi:10.5195/jmla.2021.1252

Cited by 21 publications

(22 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, genderize.io has been found by independent researchers to have an error rate comparable to other published gender prediction methods, with a error-rate on predicted names below 6% [31, 32]. However, it should be noted that the error rate varies by name origin with the largest decrease in performance on names with an Asian origin [31, 32].…”

Section: Resultsmentioning

confidence: 97%

Analysis of science journalism reveals gender and regional disparities in coverage

Davidson

Greene²

2021

Preprint

View full text Add to dashboard Cite

Scientific journalism is a critical way in which the public can remain informed and benefit from new scientific findings. Such journalism also shapes the public's view of the current state of scientific findings and legitimizes experts. Those covering science can only cite and quote a limited number of sources. Sources may be identified by the journalist's research or by recommendations by other scientists. In both cases, biases may influence who is identified and ultimately included as an expert. We analyzed 22,001 non-research articles published by Nature to quantify possible disparities. Our analysis considered three possible sources of disparity: gender, name origin, and country affiliation. To explore these sources of disparity, we extracted cited authors' names and affiliations, as well as extracted names of quoted speakers. While citations and quotations within a piece do not reflect the entire information-gathering process, they can provide insight into the demographics of visible sources. We then used the extracted names to predict gender and name origin of the cited authors and speakers. In order to appropriately quantify the level of difference, we must identify a suitable reference set for comparison. We chose first and last authors within primary research articles in Nature and a subset of Springer Nature articles in the same time period as our comparator. In our analysis, we found a skew towards male quotation in Nature journalism-related articles, but quotation is trending toward equal representation at a faster rate than first and last authorship in academic publishing. Interestingly, we found that the gender disparity in quotes was column-dependent, with the "Career Features" column reaching gender parity. Our name origin analysis found a significant over-representation of names with predicted Celtic/English origin and under-representation of names with a predicted East Asian origin. This finding was observed both in extracted quotes and journal citations, but dampened in citations. Finally, we performed an analysis to identify how countries vary in the way that they're described in scientific journalism. We focused on two groups of countries: countries that are often mentioned in articles, but do not often have affiliated authors cited, and countries that have affiliated authors that are often cited, but the country is not typically mentioned. We found that the articles in which the less cited countries occur tend to have more agricultural, extraction-related, and political terms, whereas articles including highly cited countries have broader scientific terms. This discrepancy indicates a possible lack of regional diversity in the reporting of scientific output.

show abstract

Section: Resultsmentioning

confidence: 97%

Analysis of science journalism reveals gender and regional disparities in coverage

Davidson

Greene²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Some approaches use other features instead of whole names. For example, Jensen et al 17 use n-grams of letters within names; Sebo 19 transforms names to conform to the reference data set by removing diacritics and the second part of compound names; and others bring in additional data besides names alone. [20][21][22][23] These approaches can improve accuracy, but ultimately information-theoretic limits prevent these algorithms from outperforming the Bayes error rate of their design (e.g., Leslies in Utah in 2015 may have a different proportion of women than Leslies overall, but the core problem remains unchanged: some will be misgendered).…”

Section: Methods and Limits Of Imputationmentioning

confidence: 99%

Name-Based Demographic Inference and the Unequal Distribution of Misrecognition

Lockhart¹,

King²,

Munsch³

2022

Preprint

View full text Add to dashboard Cite

Academics and companies increasingly draw on large datasets to understand the social world. Name-based demographic ascription tools are widespread for imputing information like gender and race that are often missing from these large datasets, but these approaches have drawn criticism on ethical, empirical, and theoretical grounds. Employing a survey of all authors listed on articles in sociology, economics, and communications journals in the Web of Science between 2015 and 2020, we compared self-identified demographics with name-based imputations of gender and race/ethnicity for 19,924 scholars across four gender ascription tools (genderize.io, M3-inference, R’s `predictrace` and `gender` packages) and four race/ethnicity ascription tools (ethnicolor’s Florida and North Carolina voter models, and R’s `predictrace` and wru packages). We find substantial inequalities in how these tools misgender and misrecognize the race/ethnicity of authors, distributing erroneous ascriptions unevenly along other demographic traits. Because of the empirical and ethical consequences of these errors, scholars need to be cautious with the use of name-based demographic imputation, particularly when studying subgroups. We recommend five principles for the responsible use of name-based demographic ascription.

show abstract

“…Santamaría and Mihaljević (2018) and Sebo (2021a) extensively review these and other gender detection tools. In Genderize.Io, we used the technique recommended by Sebo (2021b) to improve accuracy. Generally, these tools report the proportion and number of times a name is associated with men or women, alongside the number of examples checked.…”

Section: Methodsmentioning

confidence: 99%

Gender gap among highly cited researchers, 2014–2021

Meho

2022

Quantitative Science Studies

View full text Add to dashboard Cite

This study examines the extent to which women are represented among the world’s highly cited researchers (HCRs) and explores their representation over time and across fields, regions, and countries. The study identifies 11,842 HCRs in all fields and uses Gender-API, Genderize.Io, Namsor, and the web to identify their gender. Women’s share of HCRs grew from 13.1% in 2014 to 14.0% in 2021; however, the increase is slower than that of women’s representation among the general population of authors. The data show that women’s share of HCRs would need to increase by 100% in health and social sciences, 200% in agriculture, biology, earth, and environmental sciences, 300% in mathematics and physics, and 500% in chemistry, computer science, and engineering to close the gap with men. Women’s representation among all HCRs in North America, Europe, and Oceania ranges from 15% to 18%, compared to a world average of 13.7%. Among countries with the highest number of HCRs, the gender gap is least evident in Switzerland, Brazil, Norway, the UK, and the US and most noticable in Asian countries. The study reviews factors that can be seen to influence the gender gap among HCRs and makes recommendations for improvement. Peer Review https://publons.com/publon/10.1162/qss_a_00218

show abstract

Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference

Cited by 21 publications

References 6 publications

Analysis of science journalism reveals gender and regional disparities in coverage

Analysis of science journalism reveals gender and regional disparities in coverage

Name-Based Demographic Inference and the Unequal Distribution of Misrecognition

Gender gap among highly cited researchers, 2014–2021

Contact Info

Product

Resources

About