2020
DOI: 10.48550/arxiv.2003.08444
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

NELA-GT-2019: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles

Abstract: In this paper, we present an updated version of the NELA-GT-2018 dataset (Nørregaard, Horne, and Adalı 2019), entitled NELA-GT-2019. NELA-GT-2019 contains 1.12M news articles from 260 sources collected between January 1st 2019 and December 31st 2019. Just as with NELA-GT-2018, these sources come from a wide range of mainstream news sources and alternative news sources. Included with the dataset are source-level ground truth labels from 7 different assessment sites covering multiple dimensions of veracity. The … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(9 citation statements)
references
References 1 publication
0
9
0
Order By: Relevance
“…In terms of content, the data collections most similar to the MIND corpus are the Emergent [12], NELA-GT Series [13,14,15] and FacebookHoax [9] datasets. The Emergent collection comprises news articles covering war conflicts, politics and business/technology, and it was created to study how online media handled with unverified information.…”
Section: Related Workmentioning
confidence: 99%
“…In terms of content, the data collections most similar to the MIND corpus are the Emergent [12], NELA-GT Series [13,14,15] and FacebookHoax [9] datasets. The Emergent collection comprises news articles covering war conflicts, politics and business/technology, and it was created to study how online media handled with unverified information.…”
Section: Related Workmentioning
confidence: 99%
“…Wang (2017); Shu et al (2017) collected manually labeled statements or news articles from fact-checking websites. The NELA datasets (Horne et al, 2018b;Nørregaard et al, 2019;Gruppi et al, 2020) scrape news articles directly from news outlets and use the manually annotated labels from Media Bias/Fact Check (MBFC) as site-level annotations. Social media is also a popular resource for collecting news stories (Nakamura et al, 2020;Santia and Williams, 2018;Mitra and Gilbert, 2015).…”
Section: Related Workmentioning
confidence: 99%
“…Thus we focus our analysis on other potentially confounding factors in the dataset. We use the latest aggregated site-level labels provided in NELA-GT-2019 (Gruppi et al, 2020) and report both the article-and site-level accuracy. For article-level accuracy, we assign the site-level label to all articles from that news outlet and calculate per-article accuracy.…”
Section: Logistic Regression (Lr)mentioning
confidence: 99%
See 1 more Smart Citation
“…Most existing datasets for predicting the political ideology at the news article level were created by crawling the RSS feeds of news websites with known political bias (Kulkarni et al, 2018), and then projecting the bias label from a website to all articles crawled from it, which is a form of distant supervision. The crawling could be also done using text search APIs rather than RSS feeds (Horne et al, 2019;Gruppi et al, 2020).…”
Section: Related Workmentioning
confidence: 99%