Multi-view Subword Regularization

Wang, Xinyi; Ruder, Sebastian; Neubig, Graham

doi:10.18653/v1/2021.naacl-main.40

Cited by 18 publications

(18 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the other hand, many adaptation techniques have focused on improving representation of the target language by modifying the model's vocabulary or tokenization schemes (Chung et al, 2020;Clark et al, 2021;Wang et al, 2021). This is wellmotivated: Artetxe et al (2020) emphasize representation in the vocabulary as a key factor for effective crosslingual transfer, while Rust et al (2021) find that MBERT's tokenization scheme for many languages is subpar.…”

Section: Related Workmentioning

confidence: 99%

Specializing Multilingual Language Models: An Empirical Study

Chau¹,

Smith²

2021

Proceedings of the 1st Workshop on Multilingual Representation Learning

View full text Add to dashboard Cite

Pretrained multilingual language models have become a common tool in transferring NLP capabilities to low-resource languages, often with adaptations. In this work, we study the performance, extensibility, and interaction of two such adaptations: vocabulary augmentation and script transliteration. Our evaluations on part-of-speech tagging, universal dependency parsing, and named entity recognition in nine diverse low-resource languages uphold the viability of these approaches while raising new questions around how to optimally adapt multilingual models to low-resource settings.

show abstract

Section: Related Workmentioning

confidence: 99%

Specializing Multilingual Language Models: An Empirical Study

Chau¹,

Smith²

2021

Proceedings of the 1st Workshop on Multilingual Representation Learning

View full text Add to dashboard Cite

show abstract

“…Consequently, models such as multilingual BERT [Devlin et al, 2019] XLM-R [Conneau et al, 2020] employ the same subword tokenization algorithms as monolingual models, now applied to a massively multilingual corpus. In the multilingual setting, the problems of subword-based tokenization are exacerbated as tokens in languages with few data are over-segmented while high-frequency tokens are under-segmented, which limits cross-lingual transfer [Wang et al, 2021]. This motivates our work as well as recent work on character-level models.…”

Section: Related Workmentioning

confidence: 99%

“…To make a model more robust to morphological and compositional generalization, probabilistic segmentation algorithms such as subword regularization [Kudo, 2018] and BPE-dropout [Provilkov et al, 2020] have been proposed, which sample different segmentations during training. Recent methods propose to make models more robust for downstream tasks by enforcing prediction consistency between deterministic and probabilistic segmentations [Wang et al, 2021] and propose to update the tokenizer based on the downstream loss under different segmentations [Hiraoka et al, 2020[Hiraoka et al, , 2021. [He et al, 2020] proposed DPE (dynamic programming encoding), a segmentation-based tokenization algorithm based on dynamic programming.…”

Section: Related Workmentioning

confidence: 99%

“…As a result, models are brittle to rare words [Gong et al, 2018] and perturbations, both natural and adversarial [Belinkov and Bisk, 2018, Pruthi et al, 2019, Sun et al, 2020. In multilingual models, tokens in low-resource languages are split into many subwords, which impacts performance on those languages and deteriorates crosslingual transfer , Wang et al, 2021. Finally, a separate tokenization algorithm leads to a mismatch between the pre-training and downstream distribution of words when adapting pre-trained language models to new settings, which requires significant engineering effort to overcome.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

Tay¹,

Trần²,

Ruder³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adaptation to new settings. In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. To this end, we introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. We additionally introduce CHARFORMER, a deep Transformer model that integrates GBST and operates on the byte level. Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that CHARFORMER outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models. Additionally, CHARFORMER is fast, improving the speed of both vanilla byte-level and subword-level Transformers by 28-100% while maintaining competitive quality. We believe this work paves the way for highly performant token-free models that are trained completely end-to-end.

show abstract

“…On the other hand, many adaptation techniques have focused on improving representation of the target language by modifying the model's vocabulary or tokenization schemes (Chung et al, 2020;Clark et al, 2021;Wang et al, 2021). This is well-motivated: Artetxe et al (2020) emphasize representation in the vocabulary as a key factor for effective crosslingual transfer, while Rust et al (2020) find that MBERT's tokenization scheme for many languages is subpar.…”

Section: Related Workmentioning

confidence: 99%

Specializing Multilingual Language Models: An Empirical Study

Chau,

Smith

2021

Preprint

View full text Add to dashboard Cite

Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks in many different languages, but the success of this approach is far from universal. For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data, motivating additional model adaptations to achieve reasonably strong performance. In this work, we study the performance, extensibility, and interaction of two such adaptations for this low-resource setting: vocabulary augmentation and script transliteration. Our evaluations on a set of three tasks in nine diverse low-resource languages yield a mixed result, upholding the viability of these approaches while raising new questions around how to optimally adapt multilingual models to low-resource settings. 1 We use specialization to denote preparing a model for use on a specific target language, to the exclusion of others. This is a subset of adaptation, which involves all techniques that adjust a model for use on target languages, regardless of their resulting universality.

show abstract

Multi-view Subword Regularization

Cited by 18 publications

References 31 publications

Specializing Multilingual Language Models: An Empirical Study

Specializing Multilingual Language Models: An Empirical Study

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

Specializing Multilingual Language Models: An Empirical Study

Contact Info

Product

Resources

About