“…In fact, a recent survey on automatic web page classification has deemed the task difficult not only due to the complexity and heterogeneity of web content, but also due its the high computational cost, suggesting that machine learning (ML) approaches have much to contribute to it (Hashemi, 2020). While certain notable endeavors have indeed analyzed specific aspects of corpora such as the Common Crawl (Kolias et al, 2014;Caswell et al, 2021) and Wikipedia (Hube, 2017), they have only scratched the surface of what these bodies of text contain. For instance, recent work has found that the Common Crawl contained over 300,000 documents from unreliable news sites and banned subReddit pages containing hate speech and racism (Gehman et al, 2020), while complementary research has shown that individual training examples can be extracted by querying language models (Carlini et al, 2020), together illustrating that the presence of questionable content is a significant issue for statistical language models.…”