LanguageCrawl: a generic tool for building language models upon common Crawl

Roziewski, Szymon; Kozłowski, Marek

doi:10.1007/s10579-021-09551-7

Cited by 9 publications

(4 citation statements)

References 17 publications

(21 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This approach resembled the semi-supervised learning (SSL) technique [37,38], in which a large unannotated dataset was assigned labels based on a classifier trained on a much smaller annotated dataset. We took the Polish subset of the Common Crawl archive, called hereinafter pCC , by filtering the whole set with the LanguageCrawl toolkit [39]. It resulted in a few billion web pages with some Polish content.…”

Section: Terabot-therapeutic Spoken Dialogue Systemmentioning

confidence: 99%

AI to Train AI: Using ChatGPT to Improve the Accuracy of a Therapeutic Dialogue System

Gabor-Siatkowska,

Sowański,

Rzatkiewicz

et al. 2023

Electronics

Self Cite

View full text Add to dashboard Cite

In this work, we present the use of one artificial intelligence (AI) application (ChatGPT) to train another AI-based application. As the latter one, we show a dialogue system named Terabot, which was used in the therapy of psychiatric patients. Our study was motivated by the fact that for such a domain-specific system, it was difficult to acquire large real-life data samples to increase the training database: this would require recruiting more patients, which is both time-consuming and costly. To address this gap, we have employed a neural large language model: ChatGPT version 3.5, to generate data solely for training our dialogue system. During initial experiments, we identified intents that were most often misrecognized. Next, we fed ChatGPT with a series of prompts, which triggered the language model to generate numerous additional training entries, e.g., alternatives to the phrases that had been collected during initial experiments with healthy users. This way, we have enlarged the training dataset by 112%. In our case study, for testing, we used 2802 speech recordings originating from 32 psychiatric patients. As an evaluation metric, we used the accuracy of intent recognition. The speech samples were converted into text using automatic speech recognition (ASR). The analysis showed that the patients’ speech challenged the ASR module significantly, resulting in deteriorated speech recognition and, consequently, low accuracy of intent recognition. However, thanks to the augmentation of the training data with ChatGPT-generated data, the intent recognition accuracy increased by 13% relatively, reaching 86% in total. We also emulated the case of an error-free ASR and showed the impact of ASR misrecognitions on the intent recognition accuracy. Our study showcased the potential of using generative language models to develop other AI-based tools, such as dialogue systems.

show abstract

Section: Terabot-therapeutic Spoken Dialogue Systemmentioning

confidence: 99%

AI to Train AI: Using ChatGPT to Improve the Accuracy of a Therapeutic Dialogue System

Gabor-Siatkowska,

Sowański,

Rzatkiewicz

et al. 2023

Electronics

Self Cite

View full text Add to dashboard Cite

show abstract

“…N-Grams are capable of displaying various word clusters ranging in length from 2 words (2-Grams) to 4 words (4-Grams). n-Grams are valuable for various purposes, such as improving the accuracy of speech recognition, spell checking, or machine translation systems (Roziewski & Kozłowski, 2021). Besides presenting a list of word groups, the n-Gram table also provides information on the frequency of occurrence of the word group in the corpus as well as the number of texts containing the word group.…”

Section: The Meaning Of Cacat Difabel and Disabilitasmentioning

confidence: 99%

Why do Words with Negative Connotations Still Exist? A Corpus-Based Analysis of the Words ‘Handicapped’, ‘Diffable’, and ‘Disability’

Yolanda,

Setyono

2023

rupkatha

View full text Add to dashboard Cite

This corpus-based study examines the persistence of negatively connoted words in the Indonesian, particularly focusing on cacat (handicapped). Cacat is compared to its synonyms, namely difabel (difable) and disabilitas (disability). The study employs a mixed-methods approach, using data from Indonesian corpora, specifically ‘ind_mixed_2013’ and ‘Korpus Indonesia.’ The analysis results indicate a gradual transition from the use of the word cacat to disabilitas in discussions about human conditions, while cacat still retains important metaphorical meanings in specific contexts and is irreplaceable. Recommendations encompass a review of language term absorption rules in Indonesian, stipulating that new words must be euphemistic and free from negative connotations, to be undertaken by the government.

show abstract

“…Finally, while larger corpora generally result in better models (Kaplan et al, 2020;Sun et al, 2017), data quality and corpora content also plays a major role in the caliber and appropriateness of these models for the various downstream applications (Florez, 2019;Abid et al, 2021;Bhardwaj et al, 2021). To produce high quality and safe neural language models will likely require the community to adopt more mindful data collection practices (Gehman et al, 2020;Bender and Friedman, 2018;Gebru et al, 2018;Jo and Gebru, 2020;Paullada et al, 2020;Bender et al, 2021), establish standardized filtering pipelines for corpora (Roziewski and Stokowiec, 2016;Ortiz Suarez et al, 2019;Wenzek et al, 2020), and develop methods for evaluating the bias in trained models (Schick et al, 2021). We recognize that this is not a straightforward task with a one-size-fits all solution, but we propose that as much attention should be dedicated to the corpora used for training language models as to the models themselves, and that corpora transparency is a prerequisite for language model accountability.…”

Section: Future Workmentioning

confidence: 99%

What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus

Luccioni,

Viviano

2021

Preprint

View full text Add to dashboard Cite

Whereas much of the success of the current generation of neural language models has been driven by increasingly large training corpora, relatively little research has been dedicated to analyzing these massive sources of textual data. In this exploratory analysis, we delve deeper into the Common Crawl, a colossal web corpus that is extensively used for training language models. We find that it contains a significant amount of undesirable content, including hate speech and sexually explicit content, even after filtering procedures. We discuss the potential impacts of this content on language models and conclude with future research directions and a more mindful approach to corpus collection and analysis.

show abstract

LanguageCrawl: a generic tool for building language models upon common Crawl

Cited by 9 publications

References 17 publications

AI to Train AI: Using ChatGPT to Improve the Accuracy of a Therapeutic Dialogue System

AI to Train AI: Using ChatGPT to Improve the Accuracy of a Therapeutic Dialogue System

Why do Words with Negative Connotations Still Exist? A Corpus-Based Analysis of the Words ‘Handicapped’, ‘Diffable’, and ‘Disability’

What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus

Contact Info

Product

Resources

About