Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.455
|View full text |Cite
|
Sign up to set email alerts
|

Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding

Abstract: Inflectional variation is a common feature of World Englishes such as Colloquial Singapore English and African American Vernacular English. Although comprehension by human readers is usually unimpaired by nonstandard inflections, current NLP systems are not yet robust. We propose Base-Inflection Encoding (BITE), a method to tokenize English text by reducing inflected words to their base forms before reinjecting the grammatical information as special symbols. Fine-tuning pretrained NLP models for downstream tas… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
16
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 23 publications
(17 citation statements)
references
References 46 publications
0
16
0
Order By: Relevance
“…Several recent studies have examined how the performance of PLMs is affected by their input segmentation. Tan et al (2020) show that tokenizing inflected words into stems and inflection symbols allows BERT to generalize better on non-standard inflections. Bostrom and Durrett (2020) pretrain RoBERTa with different tokenization methods and find tokenizations that align more closely with morphology to perform better on a number of tasks.…”
Section: Related Workmentioning
confidence: 96%
“…Several recent studies have examined how the performance of PLMs is affected by their input segmentation. Tan et al (2020) show that tokenizing inflected words into stems and inflection symbols allows BERT to generalize better on non-standard inflections. Bostrom and Durrett (2020) pretrain RoBERTa with different tokenization methods and find tokenizations that align more closely with morphology to perform better on a number of tasks.…”
Section: Related Workmentioning
confidence: 96%
“…2, but, within this data, we still observe many hallmark features of Singlish such as discourse markers and vocabulary from relevant languages. Tan et al (2020) have also released a webcrawler that collects posts from an popular Singaporean forum about hardware, where discussion is often in Singlish. They use the resulting Singlish corpus as part of their work to investigate the role of inflection for NLP with non-standard forms of English.…”
Section: Creoles and Corporamentioning
confidence: 99%
“…In general, different dialects in English do not affect understanding for native English speakers as much as they affect current NLP systems. This has been considered by Tan et al (Tan et al, 2020), proposing a new encoding scheme for word tokenization to better capture these variants. One can also consider applying OCR correction models that work at a token level to normalize such texts into proper English as well.…”
Section: Related Workmentioning
confidence: 99%