On Language Models for Creoles

Lent, Heather; Bugliarello, Emanuele; Lhoneux, Miryam de; Chen, Qiu; Søgaard, Anders

doi:10.18653/v1/2021.conll-1.5

Cited by 5 publications

(8 citation statements)

References 29 publications

(34 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We additionally contribute to the body work seeking to characterize and adapt neural model performance on rare or novel examples and classes (Horn & Perona, 2017;Bengio, 2015). In the context of language modeling, Lent et al (2021) explored performance on under-resourced languages, whereas Oren et al (2019) did so on under-represented domains in training corpora. Mc-Coy et al (2021) introduced analyses to assess sequential and syntactic novelty in LMs.…”

Section: Related Workmentioning

confidence: 99%

“…Meister & Cotterell (2021), for example, investigated the statistical tendencies of the distribution defined by neural LMs, whereas Kulikov et al (2021) explored whether they adequately capture the modes of the distribution they attempt to model. At the same time, increased focus has been given to performance on rare or novel events in the data distribution, both for models of natural language (McCoy et al, 2021;Lent et al, 2021;Dudy & Bedrick, 2020;Oren et al, 2019) and neural models more generally (see, for example Sagawa et al, 2020;D'souza et al, 2021;Blevins & Zettlemoyer, 2020;Czarnowska et al, 2019;Horn & Perona, 2017;Ouyang et al, 2016;Bengio, 2015;Zhu et al, 2014). Neither of these branches of work, however, has explored instancelevel LM performance on rare sequences in the distribution.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Evaluating Distributional Distortion in Neural Language Modeling

LeBrun¹,

Sordoni²,

O’Donnell³

2022

Preprint

View full text Add to dashboard Cite

A fundamental characteristic of natural language is the high rate at which speakers produce novel expressions. Because of this novelty, a heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language (Baayen, 2001). Standard language modeling metrics such as perplexity quantify the performance of language models (LM) in aggregate. As a result, we have relatively little understanding of whether neural LMs accurately estimate the probability of sequences in this heavy-tail of rare events. To address this gap, we develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages from which we can exactly compute sequence probabilities. Training LMs on generations from these artificial languages, we compare the sequence-level probability estimates given by LMs to the true probabilities in the target language. Our experiments reveal that LSTM and Transformer language models (i) systematically underestimate the probability of sequences drawn from the target language, and (ii) do so more severely for lessprobable sequences. Investigating where this probability mass went, (iii) we find that LMs tend to overestimate the probability of ill-formed (perturbed) sequences. In addition, we find that this underestimation behaviour (iv) is weakened, but not eliminated by greater amounts of training data, and (v) is exacerbated for target distributions with lower entropy.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Evaluating Distributional Distortion in Neural Language Modeling

LeBrun¹,

Sordoni²,

O’Donnell³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Hagemeijer et al (2014) presents an extensive overview of Creole data resources through 2014 for a wide variety of Creoles, many of which are more traditional corpora, (e.g., transcriptions of conversations made by linguists with formal training, or scans of documents originally written in the Creole language); though these may not have the relevant annotations for common NLP tasks. Lent et al (2021) also provides a thorough overview of existing NLP datasets for Haitian Kreyol, Singaporean Colloquial English (Singlish), and Nigerian Pidgin English. In this work, we set about the task of manually verifying each dataset presented by Hagemeijer et al (2014) and Lent et al (2021), as well as searching for additional resources.…”

Section: Creole Data and Creole Nlpmentioning

confidence: 99%

“…Lent et al (2021) also provides a thorough overview of existing NLP datasets for Haitian Kreyol, Singaporean Colloquial English (Singlish), and Nigerian Pidgin English. In this work, we set about the task of manually verifying each dataset presented by Hagemeijer et al (2014) and Lent et al (2021), as well as searching for additional resources. We present all "verified" datasets in Table 1.…”

Section: Creole Data and Creole Nlpmentioning

confidence: 99%

“…Notably, Murawaki (2016) use APiCS features (Michaelis et al, 2013) to encode Creoles, and utilize different approaches for language evolution modeling, to reach the final conclusion that Creoles are not typologically distinct from non-Creole languages. Meanwhile, Lent et al (2021) explored the question of how to effectively build language models for three Creole languages (Haitian Kreyol, Singaporean Colloquial English, and Nigerian Pidgin). Their approach involved experimenting with distributionally robust objectives (Oren et al, 2019), to ascertain whether data from a Creole's "parent" languages could help the language model to be more robust.…”

Section: Creole Data and Creole Nlpmentioning

confidence: 99%

See 1 more Smart Citation

What a Creole Wants, What a Creole Needs

Lent¹,

Ogueji²,

Lhoneux³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

In recent years, the natural language processing (NLP) community has given increased attention to the disparity of efforts directed towards high-resource languages over low-resource ones. Efforts to remedy this delta often begin with translations of existing English datasets into other languages. However, this approach ignores that different language communities have different needs. We consider a group of low-resource languages, Creole languages. Creoles are both largely absent from the NLP literature, and also often ignored by society at large due to stigma, despite these languages having sizable and vibrant communities. We demonstrate, through conversations with Creole experts and surveys of Creole-speaking communities, how the things needed from language technology can change dramatically from one language to another, even when the languages are considered to be very similar to each other, as with Creoles. We discuss the prominent themes arising from these conversations, and ultimately demonstrate that useful language technology cannot be built without involving the relevant community.

show abstract

Challenges and Strategies in Cross-Cultural NLP

Hershcovich¹,

Frank²,

Lent³

et al. 2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

Various efforts in the Natural Language Processing (NLP) community have been made to accommodate linguistic diversity and serve speakers of many different languages. However, it is important to acknowledge that speakers and the content they produce and require, vary not just by language, but also by culture. Although language and culture are tightly linked, there are important differences. Analogous to cross-lingual and multilingual NLP, cross-cultural and multicultural NLP considers these differences in order to better serve users of NLP systems. We propose a principled framework to frame these efforts, and survey existing and potential strategies.

show abstract

On Language Models for Creoles

Cited by 5 publications

References 29 publications

Evaluating Distributional Distortion in Neural Language Modeling

Evaluating Distributional Distortion in Neural Language Modeling

What a Creole Wants, What a Creole Needs

Challenges and Strategies in Cross-Cultural NLP

Contact Info

Product

Resources

About