Understanding regional context of World Wide Web using common crawl corpus

Mehmood, Muhammad Amir; Shafiq, Hafiz Muhammad; Waheed, Abdul

doi:10.1109/micc.2017.8311752

Cited by 11 publications

(7 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To test whether Sim(people, men) > Sim(people, women) at the level of collective concepts, we used word embeddings (13) extracted from the May 2017 Common Crawl corpus [CC-MAIN-2017-22; (41)], which contains a large cross section of the internet: over 630 billion words from 2.96 billion web pages and 250 uncompressed TiB of content. Although the Common Crawl is not accompanied by documentation about its contents, it likely includes informal text (e.g., blogs and discussion forums) written by many individuals, as well as more formal text written by the media, corporations, and governments, mostly in English (42,43). Using word embeddings extracted from this massive corpus, we computed the similarity in linguistic context between words-a proxy for the similarity between the concepts denoted-as the cosine of the angle between corresponding embeddings in vector space, or cosine similarity.…”

Section: Resultsmentioning

confidence: 99%

“…Recent investigations of the Common Crawl suggest that most of this corpus is written in English and based on webpages generated within a year or two of their inclusion in the corpus (43). The most prevalent 25 websites in the 2019 version include websites on patent filings, news coverage, and peer-reviewed scientific publications (43), but more informal content such as travel blogs and personal websites are also represented (42).…”

Section: Word Embeddings (Step 2)mentioning

confidence: 99%

See 1 more Smart Citation

Based on billions of words on the internet, people = men

2022

View full text Add to dashboard Cite

Recent advances have made it possible to precisely measure the extent to which any two words are used in similar contexts. In turn, this measure of similarity in linguistic context also captures the extent to which the concepts being denoted are similar. When extracted from massive corpora of text written by millions of individuals, this measure of linguistic similarity can provide insight into the collective concepts of a linguistic community, concepts that both reflect and reinforce widespread ways of thinking. Using this approach, we investigated the collective concept person / people , which forms the basis for nearly all societal decision- and policy-making. In three studies and three preregistered replications with similarity metrics extracted from a corpus of over 630 billion English words, we found that the collective concept person / people is not gender-neutral but rather prioritizes men over women—a fundamental bias in our species’ collective view of itself.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Word Embeddings (Step 2)mentioning

confidence: 99%

Based on billions of words on the internet, people = men

2022

View full text Add to dashboard Cite

show abstract

“…The textual content of news and opinion articles from the outlets listed in Figure 1 is available in the outlets online domains and/or public cache repositories such as Google cache, The Internet Wayback Machine (Notess 2002) and Common Crawl (Mehmood et al 2017). Textual content included in our analysis is circumscribed to the articles' headlines and main text and does not include other article elements such as figure captions.…”

Section: Methodsmentioning

confidence: 99%

Prevalence in News Media of Two Competing Hypotheses about COVID-19 Origins

Rozado

2021

Social Sciences

View full text Add to dashboard Cite

The COVID-19 pandemic has been one of the most disruptive and painful phenomena of the last few decades. As of July 2021, the origins of the SARS-CoV-2 virus that caused the outbreak remain a mystery. This work analyzes the prevalence in news media articles of two popular hypotheses about SARS-CoV-2 virus origins: the natural emergence and the lab-leak hypotheses. Our results show that for most of 2020, the natural emergence hypothesis was favored in news media content while the lab-leak hypothesis was largely absent. However, something changed around May 2021 that caused the prevalence of the lab-leak hypothesis to substantially increase in news media discourse. This shift has not been uniformed across media organizations but instead has manifested itself more acutely in some outlets than others. Our structural break analysis of daily news media usage of terms related to the laboratory escape hypothesis provides hints about potential sources for this sudden shift in the prevalence of the lab-leak hypothesis in prestigious news media.

show abstract

“…Swiss-AL corpus contains 8 million texts and 1.55 billion tokens. Similarly, we built a Urdu language corpus of 1.28 million Urdu webpages from CC corpus of 2.87 billion webpages [40], [50].…”

Section: Related Workmentioning

confidence: 99%