2018
DOI: 10.1371/journal.pone.0200162
|View full text |Cite
|
Sign up to set email alerts
|

Caveat emptor, computational social science: Large-scale missing data in a widely-published Reddit corpus

Abstract: As researchers use computational methods to study complex social behaviors at scale, the validity of this computational social science depends on the integrity of the data. On July 2, 2015, Jason Baumgartner published a dataset advertised to include “every publicly available Reddit comment” which was quickly shared on Bittorrent and the Internet Archive. This data quickly became the basis of many academic papers on topics including machine learning, social behavior, politics, breaking news, and hate speech. We… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
64
0
1

Year Published

2019
2019
2022
2022

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 75 publications
(66 citation statements)
references
References 17 publications
1
64
0
1
Order By: Relevance
“…A dataset containing the vast majority of the submissions and comments published on Reddit since 2007 is publicly available [7,8]. We gathered the data for the year 2017, which is nearly complete, according to recent estimates [54]. In total, we collected 96,212,869 submissions and 886,886,260 comments from 13,874,369 users.…”
Section: Reddit Commentsmentioning
confidence: 99%
“…A dataset containing the vast majority of the submissions and comments published on Reddit since 2007 is publicly available [7,8]. We gathered the data for the year 2017, which is nearly complete, according to recent estimates [54]. In total, we collected 96,212,869 submissions and 886,886,260 comments from 13,874,369 users.…”
Section: Reddit Commentsmentioning
confidence: 99%
“…Despite its recognized quality, the dataset is not flawless either. Gaffney and Matias [10] report several inconsistencies in the data. For example, comment and post data before 2008 appears to be hugely corrupted, having around 80% of posts missing, as well as 90% posts information of one month data at the interface between 2009 and 2010.…”
Section: The Reddit Datasetmentioning
confidence: 99%
“…Using Reddit consists of reading and posting comments, which consist of informal text, primarily in English, each appearing within a particular subreddit, which we treat as a categorical feature providing useful contextual signal in characterizing users. We introduce a new benchmark author identification corpus derived from the API (Gaffney and Matias, 2018) containing Reddit comments 2 One way to calculate cos (θ + m) from cos θ is cos θ cos m − sin θ sin m where sin θ is calculated as √ 1 − cos 2 θ. Note that this calculation discards the sign of θ.…”
Section: Reddit Benchmarkmentioning
confidence: 99%