Caveat emptor, computational social science: Large-scale missing data in a widely-published Reddit corpus

Gaffney, Devin; Matias, J. Nathan

doi:10.1371/journal.pone.0200162

Cited by 75 publications

(66 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…A dataset containing the vast majority of the submissions and comments published on Reddit since 2007 is publicly available [7,8]. We gathered the data for the year 2017, which is nearly complete, according to recent estimates [54]. In total, we collected 96,212,869 submissions and 886,886,260 comments from 13,874,369 users.…”

Section: Reddit Commentsmentioning

confidence: 99%

Ten Social Dimensions of Conversations and Relationships

Choi

Aiello

Varga³

et al. 2020

Proceedings of the Web Conference 2020

View full text Add to dashboard Cite

Decades of social science research identified ten fundamental dimensions that provide the conceptual building blocks to describe the nature of human relationships. Yet, it is not clear to what extent these concepts are expressed in everyday language and what role they have in shaping observable dynamics of social interactions. After annotating conversational text through crowdsourcing, we trained NLP tools to detect the presence of these types of interaction from conversations, and applied them to 160M messages written by geo-referenced Reddit users, 290k emails from the Enron corpus and 300k lines of dialogue from movie scripts. We show that social dimensions can be predicted purely from conversations with an AUC up to 0.98, and that the combination of the predicted dimensions suggests both the types of relationships people entertain (conflict vs. support) and the types of real-world communities (wealthy vs. deprived) they shape.

show abstract

Section: Reddit Commentsmentioning

confidence: 99%

Ten Social Dimensions of Conversations and Relationships

Choi

Aiello

Varga³

et al. 2020

Proceedings of the Web Conference 2020

View full text Add to dashboard Cite

show abstract

“…Despite its recognized quality, the dataset is not flawless either. Gaffney and Matias [10] report several inconsistencies in the data. For example, comment and post data before 2008 appears to be hugely corrupted, having around 80% of posts missing, as well as 90% posts information of one month data at the interface between 2009 and 2010.…”

Section: The Reddit Datasetmentioning

confidence: 99%

The Anatomy of Reddit: An Overview of Academic Research

Medvedev¹,

Lambiotte²,

Delvenne³

2019

Springer Proceedings in Complexity

View full text Add to dashboard Cite

Online forums provide rich environments where users may post questions and comments about different topics. Understanding how people behave in online forums may shed light on the fundamental mechanisms by which collective thinking emerges in a group of individuals, but it has also important practical applications, for instance to improve user experience, increase engagement or automatically identify bullying. Importantly, the datasets generated by the activity of the users are often openly available for researchers, in contrast to other sources of data in computational social science. In this survey, we map the main research directions that arose in recent years and focus primarily on the most popular platform, Reddit. We distinguish and categorise research depending on their focus on the posts or on the users, and point to different types of methodologies to extract information from the structure and dynamics of the system. We emphasize the diversity and richness of the research in terms of questions and methods, and suggest future avenues of research.

show abstract

“…Using Reddit consists of reading and posting comments, which consist of informal text, primarily in English, each appearing within a particular subreddit, which we treat as a categorical feature providing useful contextual signal in characterizing users. We introduce a new benchmark author identification corpus derived from the API (Gaffney and Matias, 2018) containing Reddit comments 2 One way to calculate cos (θ + m) from cos θ is cos θ cos m − sin θ sin m where sin θ is calculated as √ 1 − cos 2 θ. Note that this calculation discards the sign of θ.…”

Section: Reddit Benchmarkmentioning

confidence: 99%

Learning Invariant Representations of Social Media Users

Andrews

Bishop²

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

The evolution of social media users' behavior over time complicates user-level comparison tasks such as verification, classification, clustering, and ranking. As a result, naïve approaches may fail to generalize to new users or even to future observations of previously known users. In this paper, we propose a novel procedure to learn a mapping from short episodes of user activity on social media to a vector space in which the distance between points captures the similarity of the corresponding users' invariant features. We fit the model by optimizing a surrogate metric learning objective over a large corpus of unlabeled social media content. Once learned, the mapping may be applied to users not seen at training time and enables efficient comparisons of users in the resulting vector space. We present a comprehensive evaluation to validate the benefits of the proposed approach using data from Reddit, Twitter, and Wikipedia.

show abstract

Caveat emptor, computational social science: Large-scale missing data in a widely-published Reddit corpus

Cited by 75 publications

References 17 publications

Ten Social Dimensions of Conversations and Relationships

Ten Social Dimensions of Conversations and Relationships

The Anatomy of Reddit: An Overview of Academic Research

Learning Invariant Representations of Social Media Users

Contact Info

Product

Resources

About