Our system is currently under heavy load due to increased usage. We're actively working on upgrades to improve performance. Thank you for your patience.
The World Wide Web Conference 2019
DOI: 10.1145/3308558.3313684
|View full text |Cite
|
Sign up to set email alerts
|

Demographic Inference and Representative Population Estimates from Multilingual Social Media Data

Abstract: Social media provide access to behavioural data at an unprecedented scale and granularity. However, using these data to understand phenomena in a broader population is difficult due to their nonrepresentativeness and the bias of statistical inference tools towards dominant languages and groups. While demographic attribute inference could be used to mitigate such bias, current techniques are almost entirely monolingual and fail to work in a global environment. We address these challenges by combining multilingu… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
202
0
2

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 175 publications
(207 citation statements)
references
References 52 publications
(78 reference statements)
3
202
0
2
Order By: Relevance
“…Sample demographics. The age and gender distributions of our D and R cohorts align with previous studies [32][33][34] as indicated by the M3 classifier 35 that we used to predict individual's gender (M3 Macro-F1: 0.915) and age (M3 Macro-F1: 0.425) categories. As shown in Table 2, our D cohort has a similar 2:1 female-to-male ratio as observed in clinical depression studies 32,33 , indicating that the demographics of our Twitter cohort closely match previous clinical findings that women are twice as likely to be diagnosed with depression compared with men.…”
Section: Resultssupporting
confidence: 81%
See 1 more Smart Citation
“…Sample demographics. The age and gender distributions of our D and R cohorts align with previous studies [32][33][34] as indicated by the M3 classifier 35 that we used to predict individual's gender (M3 Macro-F1: 0.915) and age (M3 Macro-F1: 0.425) categories. As shown in Table 2, our D cohort has a similar 2:1 female-to-male ratio as observed in clinical depression studies 32,33 , indicating that the demographics of our Twitter cohort closely match previous clinical findings that women are twice as likely to be diagnosed with depression compared with men.…”
Section: Resultssupporting
confidence: 81%
“…However, demographic information can be reliably inferred from a variety of account characteristics, such as the individual's name and 'screen name' , profile photograph and biographies. To infer the demographic information of all Twitter accounts, we used the M3 system 35 , which is a highly accurate deep learning classifier that was trained on a massive Twitter dataset using profile images, screen names, names and biographies. The classifier is built to classify an account along three categories; (1) gender (male/female, Macro-F1: 0.915), (2) age ('18 and below' , '19-29' , '30-39' and '40 and up' , Macro-F1: 0.425) and (3) organization (individual versus organizational account, Macro-F1: 0.898).…”
Section: Methodsmentioning
confidence: 99%
“…Neither the reported diagnosis nor the Twitter profiles of the sampled individuals provide demographic information with respect to our D cohort. However, a highly accurate sex classifier 45 (Macro-F1: 0.915) applied to the Twitter profiles of our D cohort (see "Methods"), shows that it has a similar 2:1 female to male ratio as observed in clinical studies 46 , indicating that the demographics of our Twitter cohort closely match previous clinical findings. The indicated age distribution of our D cohort (though less reliable, Macro-F1: 0.425), is also in line with clinical studies 46,47 , specifically we find a decreasing number of individuals per age-group as the age of the group increases in our D cohort.…”
Section: Cohort Definitionsupporting
confidence: 69%
“…The frequency of tweets per day between cases and controls in the 365 days preceding an event were not different (Student's t test; mean tweets per day cases: 2.61 ± 9.88, mean tweets per day controls: 3.34 ± 16.08, p = 0.29). Using the M3inference package in python 47 , we estimated the age and sex of each individual where possible. The algorithm combines a convolutional neural network assessment of user profile picture, user written description, and user name to estimate the probability of the user belonging to an age class of ≤18, between 20-29, 30-39, and over 40.…”
Section: Sample Demographicsmentioning
confidence: 99%