Exploratory Analysis of a Terabyte Scale Web Corpus

Kolias, Vasilis; Anagnostopoulos, Ioannis; Kayafas, E.

doi:10.48550/arxiv.1409.5443

Cited by 2 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In fact, a recent survey on automatic web page classification has deemed the task difficult not only due to the complexity and heterogeneity of web content, but also due its the high computational cost, suggesting that machine learning (ML) approaches have much to contribute to it (Hashemi, 2020). While certain notable endeavors have indeed analyzed specific aspects of corpora such as the Common Crawl (Kolias et al, 2014;Caswell et al, 2021) and Wikipedia (Hube, 2017), they have only scratched the surface of what these bodies of text contain. For instance, recent work has found that the Common Crawl contained over 300,000 documents from unreliable news sites and banned subReddit pages containing hate speech and racism (Gehman et al, 2020), while complementary research has shown that individual training examples can be extracted by querying language models (Carlini et al, 2020), together illustrating that the presence of questionable content is a significant issue for statistical language models.…”

Section: Related Workmentioning

confidence: 99%

What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus

Luccioni,

Viviano

2021

Preprint

View full text Add to dashboard Cite

Whereas much of the success of the current generation of neural language models has been driven by increasingly large training corpora, relatively little research has been dedicated to analyzing these massive sources of textual data. In this exploratory analysis, we delve deeper into the Common Crawl, a colossal web corpus that is extensively used for training language models. We find that it contains a significant amount of undesirable content, including hate speech and sexually explicit content, even after filtering procedures. We discuss the potential impacts of this content on language models and conclude with future research directions and a more mindful approach to corpus collection and analysis.

show abstract

Section: Related Workmentioning

confidence: 99%

What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus

Luccioni,

Viviano

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…To extract the opinions from the vectors T k , the DOC-ABSADeepL model is fed with such vectors. To represent each input word by its word embedding, we consider the Fasttext word embedddings [18] trained on Common Crawl [21]. The word embedding dimension is d = 300.…”

Section: Distilling Opinions At Criterion Level: Doc-absadeepl Modelmentioning

confidence: 99%

Sentiment Analysis based Multi-person Multi-criteria Decision Making Methodology using Natural Language Processing and Deep Learning for Smarter Decision Aid. Case study of restaurant choice using TripAdvisor reviews

Zuheros¹,

Martínez‐Cámara²,

Herrera‐Viedma³

et al. 2020

Preprint

View full text Add to dashboard Cite

Decision making models are constrained by taking the expert evaluations with pre-defined numerical or linguistic terms. We claim that the use of sentiment analysis will allow decision making models to consider expert evaluations in natural language. Accordingly, we propose the Sentiment Analysis based Multi-person Multi-criteria Decision Making (SA-MpMcDM) methodology, which builds the expert evaluations from their natural language reviews, and even from their numerical ratings if they are available. The SA-MpMcDM methodology incorporates an end-to-end multi-task deep learning model for aspect based sentiment analysis, named DOC-ABSADeepL model, able to identify the aspect categories mentioned in an expert review, and to distill their opinions and criteria. The individual expert evaluations are aggregated via a criteria weighting through the attention of the experts. We evaluate the methodology in a restaurant decision problem, hence we build the TripR-2020 dataset of restaurant reviews, which we manually annotate and release. We analyze the SA-MpMcDM methodology in different scenarios using and not using natural

show abstract

Exploratory Analysis of a Terabyte Scale Web Corpus

Cited by 2 publications

References 0 publications

What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus

What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus

Sentiment Analysis based Multi-person Multi-criteria Decision Making Methodology using Natural Language Processing and Deep Learning for Smarter Decision Aid. Case study of restaurant choice using TripAdvisor reviews

Contact Info

Product

Resources

About