2020
DOI: 10.1609/icwsm.v14i1.7354
|View full text |Cite
|
Sign up to set email alerts
|

Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board

Abstract: This paper presents a dataset with over 3.3M threads and 134.5M posts from the Politically Incorrect board (/pol/) of the imageboard forum 4chan, posted over a period of almost 3.5 years (June 2016-November 2019). To the best of our knowledge, this represents the largest publicly available 4chan dataset, providing the community with an archive of posts that have been permanently deleted from 4chan and are otherwise inaccessible. We augment the data with a set of additional labels, including toxicity scores and… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
32
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 57 publications
(38 citation statements)
references
References 15 publications
0
32
0
Order By: Relevance
“…The data collection process is more challenging than those utilized in recent works on 4chan (Papasavva et al 2020) because many of the alternative boards do not have publicly available APIs (Application Program Interfaces). To overcome this, we obtain the catalog of active threads in each board by systematically browsing the index pages and scraping the identifiers of the most recent posts being discussed, as well as the threads they belonged to.…”
Section: Methodsmentioning
confidence: 99%
“…The data collection process is more challenging than those utilized in recent works on 4chan (Papasavva et al 2020) because many of the alternative boards do not have publicly available APIs (Application Program Interfaces). To overcome this, we obtain the catalog of active threads in each board by systematically browsing the index pages and scraping the identifiers of the most recent posts being discussed, as well as the threads they belonged to.…”
Section: Methodsmentioning
confidence: 99%
“…(Aliapoulios et al 2021) published a dataset consisting of 183M posts and 13.25M user profiles from Parler, a Twitter alternative. Last, (Papasavva et al 2020) present a dataset with over 3.3M threads and 134.5M posts from the Politically Incorrect board (/pol/) of the imageboard forum 4chan.…”
Section: Related Workmentioning
confidence: 99%
“…Our dataset provides several opportunities to the research community. First, Voat was evidently the place many banned users and communities moved to after being banned from other platforms (Papasavva et al 2020;Chandrasekharan et al 2017). To this end, our dataset can assist researchers that focus on deplatforming and user migration.…”
mentioning
confidence: 99%
“…BERT (Devlin et al 2018), a recent language model from Google, outperforms other traditional techniques like neural networks (Huang, Ou, and Carley 2018) in many NLP tasks including sentiment analysis because it has the ability to capture the context around words. While there are many variations of ABSA with BERT, we choose (Xu et al 2019) as our implementation due to a simplicity while yielding reasonable accuracy when compared to a very complex model like (Rietzler et al 2019).…”
Section: Wordmentioning
confidence: 99%
“…There are several studies trying to understand general activities in web-based discussion forums. (Hine et al 2017), (Papasavva et al 2020) and (Thukral et al 2018) work on understanding properties, trends and characteristics of forums like ephemerality, heavy-tail and anonymity on posts, threads and users. Some focus on specific tasks in forums (Macdonald et al 2015), (Munger et al 2015) (Shrestha et al 2019) and try to identify main actors like hacker users, depressed users and influential users using a variety of techniques including linguistics, behavioral modeling on user activities and graph-based approaches.…”
Section: Related Workmentioning
confidence: 99%