2017 IEEE 13th Malaysia International Conference on Communications (MICC) 2017
DOI: 10.1109/micc.2017.8311752
|View full text |Cite
|
Sign up to set email alerts
|

Understanding regional context of World Wide Web using common crawl corpus

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 11 publications
(7 citation statements)
references
References 3 publications
0
7
0
Order By: Relevance
“…To test whether Sim(people, men) > Sim(people, women) at the level of collective concepts, we used word embeddings (13) extracted from the May 2017 Common Crawl corpus [CC-MAIN-2017-22; (41)], which contains a large cross section of the internet: over 630 billion words from 2.96 billion web pages and 250 uncompressed TiB of content. Although the Common Crawl is not accompanied by documentation about its contents, it likely includes informal text (e.g., blogs and discussion forums) written by many individuals, as well as more formal text written by the media, corporations, and governments, mostly in English (42,43). Using word embeddings extracted from this massive corpus, we computed the similarity in linguistic context between words-a proxy for the similarity between the concepts denoted-as the cosine of the angle between corresponding embeddings in vector space, or cosine similarity.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…To test whether Sim(people, men) > Sim(people, women) at the level of collective concepts, we used word embeddings (13) extracted from the May 2017 Common Crawl corpus [CC-MAIN-2017-22; (41)], which contains a large cross section of the internet: over 630 billion words from 2.96 billion web pages and 250 uncompressed TiB of content. Although the Common Crawl is not accompanied by documentation about its contents, it likely includes informal text (e.g., blogs and discussion forums) written by many individuals, as well as more formal text written by the media, corporations, and governments, mostly in English (42,43). Using word embeddings extracted from this massive corpus, we computed the similarity in linguistic context between words-a proxy for the similarity between the concepts denoted-as the cosine of the angle between corresponding embeddings in vector space, or cosine similarity.…”
Section: Resultsmentioning
confidence: 99%
“…Recent investigations of the Common Crawl suggest that most of this corpus is written in English and based on webpages generated within a year or two of their inclusion in the corpus (43). The most prevalent 25 websites in the 2019 version include websites on patent filings, news coverage, and peer-reviewed scientific publications (43), but more informal content such as travel blogs and personal websites are also represented (42).…”
Section: Word Embeddings (Step 2)mentioning
confidence: 99%
“…The textual content of news and opinion articles from the outlets listed in Figure 1 is available in the outlets online domains and/or public cache repositories such as Google cache, The Internet Wayback Machine (Notess 2002) and Common Crawl (Mehmood et al 2017). Textual content included in our analysis is circumscribed to the articles' headlines and main text and does not include other article elements such as figure captions.…”
Section: Methodsmentioning
confidence: 99%
“…Swiss-AL corpus contains 8 million texts and 1.55 billion tokens. Similarly, we built a Urdu language corpus of 1.28 million Urdu webpages from CC corpus of 2.87 billion webpages [40], [50].…”
Section: Related Workmentioning
confidence: 99%