Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding

Tan, Samson; Joty, Shafiq; Varshney, Lav R.; Kan, Min-Yen

doi:10.18653/v1/2020.emnlp-main.455

Cited by 23 publications

(17 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several recent studies have examined how the performance of PLMs is affected by their input segmentation. Tan et al (2020) show that tokenizing inflected words into stems and inflection symbols allows BERT to generalize better on non-standard inflections. Bostrom and Durrett (2020) pretrain RoBERTa with different tokenization methods and find tokenizations that align more closely with morphology to perform better on a number of tasks.…”

Section: Related Workmentioning

confidence: 96%

Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

Hofmann¹,

Pierrehumbert²,

Schuze³

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

How does the input segmentation of pretrained language models (PLMs) affect their interpretations of complex words? We present the first study investigating this question, taking BERT as the example PLM and focusing on its semantic representations of English derivatives. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. This hypothesis is confirmed by a series of semantic probing tasks on which Del-BERT (Derivation leveraging BERT), a model with derivational input segmentation, substantially outperforms BERT with WordPiece segmentation. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used.

show abstract

Section: Related Workmentioning

confidence: 96%

Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

Hofmann¹,

Pierrehumbert²,

Schuze³

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…2, but, within this data, we still observe many hallmark features of Singlish such as discourse markers and vocabulary from relevant languages. Tan et al (2020) have also released a webcrawler that collects posts from an popular Singaporean forum about hardware, where discussion is often in Singlish. They use the resulting Singlish corpus as part of their work to investigate the role of inflection for NLP with non-standard forms of English.…”

Section: Creoles and Corporamentioning

confidence: 99%

On Language Models for Creoles

Lent¹,

Bugliarello²,

Lhoneux³

et al. 2021

Proceedings of the 25th Conference on Computational Natural Language Learning

View full text Add to dashboard Cite

Creole languages such as Nigerian Pidgin English and Haitian Creole are under-resourced and largely ignored in the NLP literature. Creoles typically result from the fusion of a foreign language with multiple local languages, and what grammatical and lexical features are transferred to the creole is a complex process (Sessarego, 2020). While creoles are generally stable, the prominence of some features may be much stronger with certain demographics or in some linguistic situations (Winford, 1999;Patrick, 1999). This paper makes several contributions: We collect existing corpora and release models for Haitian Creole, Nigerian Pidgin English, and Singaporean Colloquial English. We evaluate these models on intrinsic and extrinsic tasks. Motivated by the above literature, we compare standard language models with distributionally robust ones and find that, somewhat surprisingly, the standard language models are superior to the distributionally robust ones. We investigate whether this is an effect of overparameterization or relative distributional stability, and find that the difference persists in the absence of over-parameterization, and that drift is limited, confirming the relative stability of creole languages.

show abstract

“…In general, different dialects in English do not affect understanding for native English speakers as much as they affect current NLP systems. This has been considered by Tan et al (Tan et al, 2020), proposing a new encoding scheme for word tokenization to better capture these variants. One can also consider applying OCR correction models that work at a token level to normalize such texts into proper English as well.…”

Section: Related Workmentioning

confidence: 99%

Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts

Allen¹,

Pethe²,

Inoue³

et al. 2021

Findings of the Association for Computational Linguistics: EMNLP 2021

View full text Add to dashboard Cite

Substantial amounts of work are required to clean large collections of digitized books for NLP analysis, both because of the presence of errors in the scanned text and the presence of duplicate volumes in the corpora. In this paper, we consider the issue of deduplication in the presence of optical character recognition (OCR) errors. We present methods to handle these errors, evaluated on a collection of 19,347 texts from the Project Gutenberg dataset and 96,635 texts from the HathiTrust Library. We demonstrate that improvements in language models now enable the detection and correction of OCR errors without consideration of the scanning image itself. The inconsistencies found by aligning pairs of scans of the same underlying work provides training data to build models for detecting and correcting errors. We identify the canonical version for each of 17,136 repeatedly-scanned books from 58,808 scans. Finally, we investigate methods to detect and correct errors in single-copy texts. We show that on average, our method corrects over six times as many errors as it introduces. We also provide interesting analysis on the relation between scanning quality and other factors such as location and publication year.

show abstract

Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding

Cited by 23 publications

References 46 publications

Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

On Language Models for Creoles

Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts

Contact Info

Product

Resources

About