2014
DOI: 10.48550/arxiv.1409.5443
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Exploratory Analysis of a Terabyte Scale Web Corpus

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 0 publications
0
2
0
Order By: Relevance
“…In fact, a recent survey on automatic web page classification has deemed the task difficult not only due to the complexity and heterogeneity of web content, but also due its the high computational cost, suggesting that machine learning (ML) approaches have much to contribute to it (Hashemi, 2020). While certain notable endeavors have indeed analyzed specific aspects of corpora such as the Common Crawl (Kolias et al, 2014;Caswell et al, 2021) and Wikipedia (Hube, 2017), they have only scratched the surface of what these bodies of text contain. For instance, recent work has found that the Common Crawl contained over 300,000 documents from unreliable news sites and banned subReddit pages containing hate speech and racism (Gehman et al, 2020), while complementary research has shown that individual training examples can be extracted by querying language models (Carlini et al, 2020), together illustrating that the presence of questionable content is a significant issue for statistical language models.…”
Section: Related Workmentioning
confidence: 99%
“…In fact, a recent survey on automatic web page classification has deemed the task difficult not only due to the complexity and heterogeneity of web content, but also due its the high computational cost, suggesting that machine learning (ML) approaches have much to contribute to it (Hashemi, 2020). While certain notable endeavors have indeed analyzed specific aspects of corpora such as the Common Crawl (Kolias et al, 2014;Caswell et al, 2021) and Wikipedia (Hube, 2017), they have only scratched the surface of what these bodies of text contain. For instance, recent work has found that the Common Crawl contained over 300,000 documents from unreliable news sites and banned subReddit pages containing hate speech and racism (Gehman et al, 2020), while complementary research has shown that individual training examples can be extracted by querying language models (Carlini et al, 2020), together illustrating that the presence of questionable content is a significant issue for statistical language models.…”
Section: Related Workmentioning
confidence: 99%
“…To extract the opinions from the vectors T k , the DOC-ABSADeepL model is fed with such vectors. To represent each input word by its word embedding, we consider the Fasttext word embedddings [18] trained on Common Crawl [21]. The word embedding dimension is d = 300.…”
Section: Distilling Opinions At Criterion Level: Doc-absadeepl Modelmentioning
confidence: 99%