Proceedings of the 2018 EMNLP Workshop W-Nut: The 4th Workshop on Noisy User-Generated Text 2018
DOI: 10.18653/v1/w18-6103
|View full text |Cite
|
Sign up to set email alerts
|

Geocoding Without Geotags: A Text-based Approach for reddit

Abstract: In this paper, we introduce the first geolocation inference approach for reddit, a social media platform where user pseudonymity has thus far made supervised demographic inference difficult to implement and validate. In particular, we design a text-based heuristic schema to generate ground truth location labels for reddit users in the absence of explicitly geotagged data. After evaluating the accuracy of our labeling procedure, we train and test several geolocation inference models across our reddit data set a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
19
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 16 publications
(19 citation statements)
references
References 31 publications
(37 reference statements)
0
19
0
Order By: Relevance
“…The Hybrid method, subsequently used in Section 4.3.2, achieves 99% test set accuracy and 68% coverage on the full dataset. Harrigian's (2018) method assigns a country estimate to every user with 78% test set accuracy. For gender, accuracy decreases from the Username, Self-reported, and Language use method, while coverage increases 11 .…”
Section: Evaluation Of Nlp Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The Hybrid method, subsequently used in Section 4.3.2, achieves 99% test set accuracy and 68% coverage on the full dataset. Harrigian's (2018) method assigns a country estimate to every user with 78% test set accuracy. For gender, accuracy decreases from the Username, Self-reported, and Language use method, while coverage increases 11 .…”
Section: Evaluation Of Nlp Methodsmentioning
confidence: 99%
“…The only published method for Reddit user localisation to date (Harrigian, 2018) infers a user's country of residence via a dirichlet process mixture model 4 . It uses the distribution of words, posts per subreddit, and posts per hour of the day (timezone proxy) of a user's up to 250 most recent comments.…”
Section: Country Of Residencementioning
confidence: 99%
“…Predicting latent user demographics on social media is a popular area of research, with researchers developing models to infer categories such as gender [17][18][19], age [20][21][22], race and ethnicity [23][24][25], and political orientation [26][27][28]. Specifically, latent attribute models trained on Reddit data are becoming increasingly popular, with models for inferring geolocation [29], gender [30], and mental health status [31,32], all recently published.…”
Section: Introductionmentioning
confidence: 99%
“…Text features and network information can identify user attribute parameters [119]. Location tags for Reddit users can be generated despite the absence of explicit geotagging data [120]. By mining the common characteristics of the same user from different networks, users can be matched in heterogeneous social networks.…”
Section: B Social Network Analysismentioning
confidence: 99%