Changing the World by Changing the Data

Rogers, Anna

doi:10.18653/v1/2021.acl-long.170

Cited by 37 publications

(37 citation statements)

References 42 publications

(48 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Web-scale textual or speech data is harder to obtain for Maltese compared to languages such as English or Mandarin and we expect that similar challenges arise for many other under-resourced languages (Besacier et al, 2014). Furthermore, as recent critiques of large-scale pretraining approaches have emphasised, even where web-scale data is available, there are significant risks arising from its 'un-fathomable' nature, not least that it is likely to be extremely noisy, while not guaranteeing representativeness across demographic or ethnic groups, and/or across language varieties Bender et al (2021); Rogers (2021). Lastly, the computational resources needed for such experiments are not available to all research teams.…”

Section: Introductionmentioning

confidence: 93%

Data Augmentation for Speech Recognition in Maltese: A Low-Resource Perspective

Mena¹,

DeMarco²,

Borg³

et al. 2021

Preprint

View full text Add to dashboard Cite

Developing speech technologies is a challenge for low-resource languages for which both annotated and raw speech data is sparse. Maltese is one such language. Recent years have seen an increased interest in the computational processing of Maltese, including speech technologies, but resources for the latter remain sparse. In this paper, we consider data augmentation techniques for improving speech recognition for such languages, focusing on Maltese as a test case. We consider three different types of data augmentation: unsupervised training, multilingual training and the use of synthesized speech as training data. The goal is to determine which of these techniques, or combination of them, is the most effective to improve speech recognition for languages where the starting point is a small corpus of approximately 7 hours of transcribed speech. Our results show that combining the three data augmentation techniques studied here lead us to an absolute WER improvement of 15% without the use of a language model.

show abstract

Section: Introductionmentioning

confidence: 93%

Data Augmentation for Speech Recognition in Maltese: A Low-Resource Perspective

Mena¹,

DeMarco²,

Borg³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…One aim of this work is to document failures in evaluation processes that frequently happen, many of which can be directly addressed or pointed out in reviews. In the future, we also suggest creating model evaluation checklists like those by Rogers et al (2021) for responsible data use or Dodge et al (2019) for reporting hyperparameters and compute infrastructure.…”

Section: Model Audits and Evaluation Reportsmentioning

confidence: 99%

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Gehrmann¹,

Clark²,

Sellam³

2022

Preprint

View full text Add to dashboard Cite

Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are rarely widely adopted.This issue has become more urgent, since neural NLG models have improved to the point where they can often no longer be distinguished based on the surfacelevel features that older metrics rely on. This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years. We summarize, categorize, and discuss how researchers have been addressing these issues and what their findings mean for the current state of model evaluations. Building on those insights, we lay out a long-term vision for NLG evaluation and propose concrete steps for researchers to improve their evaluation processes. Finally, we analyze 66 NLG papers from recent NLP conferences in how well they already follow these suggestions and identify which areas require more drastic changes to the status quo.

show abstract

“…The argument for curating datasets Whether or not datasets should be curated to alter underlying distributions is a foundational issue. Rogers [45] summarize the arguments for and against curation that followed after the publication of work by Bender et al [7] which argued strongly for curation. One of the core questions is whether we should study the world as it is or the world as we want it to be, where "world" refers to extant sources of data, such as Wikipedia.…”

Section: Synthetic Dataset Creation and Augmentationmentioning

confidence: 99%

SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets

Yuan

Ippolito

Nikolaev

et al. 2021

Preprint

View full text Add to dashboard Cite

NLP researchers need more, higher-quality text datasets. Human-labeled datasets are expensive to collect, while datasets collected via automatic retrieval from the web such as WikiBio [32] are noisy and can include undesired biases. Moreover, data sourced from the web is often included in datasets used to pretrain models, leading to inadvertent cross-contamination of training and test sets. In this work we introduce a novel method for efficient dataset curation: we use a large language model to provide seed generations to human raters, thereby changing dataset authoring from a writing task to an editing task. We use our method to curate SynthBio-a new evaluation set for WikiBio-composed of structured attribute lists describing fictional individuals, mapped to natural language biographies. We show that our dataset of fictional biographies is less noisy than WikiBio, and also more balanced with respect to gender and nationality.

show abstract

Changing the World by Changing the Data

Cited by 37 publications

References 42 publications

Data Augmentation for Speech Recognition in Maltese: A Low-Resource Perspective

Data Augmentation for Speech Recognition in Maltese: A Low-Resource Perspective

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets

Contact Info

Product

Resources

About