CharBERT: Character-aware Pre-trained Language Model

Ma, Wentao; Cui, Yiming; Si, Chenglei; Liu, Ting; Wang, Shijin; Hu, Guoping

doi:10.18653/v1/2020.coling-main.4

Cited by 41 publications

(36 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Bostrom and Durrett (2020) pretrain RoBERTa with different tokenization methods and find tokenizations that align more closely with morphology to perform better on a number of tasks. Ma et al (2020) show that providing BERT with character-level information also leads to enhanced performance. Relatedly, studies from automatic speech recognition have demonstrated that morphological decomposition improves the perplexity of language models (Fang et al, 2015;Jain et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

Hofmann¹,

Pierrehumbert²,

Schuze³

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

How does the input segmentation of pretrained language models (PLMs) affect their interpretations of complex words? We present the first study investigating this question, taking BERT as the example PLM and focusing on its semantic representations of English derivatives. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. This hypothesis is confirmed by a series of semantic probing tasks on which Del-BERT (Derivation leveraging BERT), a model with derivational input segmentation, substantially outperforms BERT with WordPiece segmentation. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used.

show abstract

Section: Related Workmentioning

confidence: 99%

Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

Hofmann¹,

Pierrehumbert²,

Schuze³

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…Several works propose to optimize subwordsensitive word encoding methods for pretrained language models. Ma et al (2020) uses convolutional neural networks (Kim, 2014) on characters to calculate word representations. Zhang and Li (2020) propose to add phrases into the vocabulary for Chinese pretrained language models.…”

Section: Related Workmentioning

confidence: 99%

“…Bostrom and Durrett (2020) empirically compare several popular word segmentation algorithms for pretrained language models of a single language. Several works propose to use different representation granularities, such as phrase-level segmentation (Zhang and Li, 2020) or character-aware representations (Ma et al, 2020) for pretrained language models of a single highresource language, such as English or Chinese only. However, it is not a foregone conclusion that methods designed and tested on monolingual models will be immediately applicable to multilingual representations.…”

Section: Introductionmentioning

confidence: 99%

Multi-view Subword Regularization

Wang¹,

Ruder²,

Neubig³

2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Multilingual pretrained representations generally rely on subword segmentation algorithms to create a shared multilingual vocabulary. However, standard heuristic algorithms often lead to sub-optimal segmentation, especially for languages with limited amounts of data. In this paper, we take two major steps towards alleviating this problem. First, we demonstrate empirically that applying existing subword regularization methods (Kudo, 2018;Provilkov et al., 2020) during fine-tuning of pre-trained multilingual representations improves the effectiveness of cross-lingual transfer. Second, to take full advantage of different possible input segmentations, we propose Multi-view Subword Regularization (MVR), a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations. Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms. 1

show abstract

“…Furthermore, the authors claim that it is more robust to noise and misspellings. In the same vein, Ma et al (2020a) combined character-aware and subword-based information to improve robustness to spelling errors. This initiated a new wave of tokenizer-free models based on characters or bytes (Tay et al, 2021;Xue et al, 2021;Clark et al, 2021).…”

Section: Tokenization and Character-based Modelsmentioning

confidence: 99%

Can Character-based Language Models Improve Downstream Task Performances In Low-Resource And Noisy Language Scenarios?

Riabi¹,

Sagot²,

Seddah³

2021

Proceedings of the Seventh Workshop on Noisy User-Generated Text (W-Nut 2021)

View full text Add to dashboard Cite

Recent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen highresource languages. Building language models and, more generally, NLP systems for nonstandardized and low-resource languages remains a challenging task. In this work, we focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data displaying a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a characterbased model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pretrained on large multilingual and monolingual models. Confirming these results a on much larger data set of noisy French user-generated content, we argue that such character-based language models can be an asset for NLP in low-resource and high language variability settings.

show abstract

CharBERT: Character-aware Pre-trained Language Model

Cited by 41 publications

References 27 publications

Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

Multi-view Subword Regularization

Can Character-based Language Models Improve Downstream Task Performances In Low-Resource And Noisy Language Scenarios?

Contact Info

Product

Resources

About