Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2021
DOI: 10.18653/v1/2021.naacl-main.40
|View full text |Cite
|
Sign up to set email alerts
|

Multi-view Subword Regularization

Abstract: Multilingual pretrained representations generally rely on subword segmentation algorithms to create a shared multilingual vocabulary. However, standard heuristic algorithms often lead to sub-optimal segmentation, especially for languages with limited amounts of data. In this paper, we take two major steps towards alleviating this problem. First, we demonstrate empirically that applying existing subword regularization methods (Kudo, 2018;Provilkov et al., 2020) during fine-tuning of pre-trained multilingual rep… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
18
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 18 publications
(18 citation statements)
references
References 31 publications
0
18
0
Order By: Relevance
“…On the other hand, many adaptation techniques have focused on improving representation of the target language by modifying the model's vocabulary or tokenization schemes (Chung et al, 2020;Clark et al, 2021;Wang et al, 2021). This is wellmotivated: Artetxe et al (2020) emphasize representation in the vocabulary as a key factor for effective crosslingual transfer, while Rust et al (2021) find that MBERT's tokenization scheme for many languages is subpar.…”
Section: Related Workmentioning
confidence: 99%
“…On the other hand, many adaptation techniques have focused on improving representation of the target language by modifying the model's vocabulary or tokenization schemes (Chung et al, 2020;Clark et al, 2021;Wang et al, 2021). This is wellmotivated: Artetxe et al (2020) emphasize representation in the vocabulary as a key factor for effective crosslingual transfer, while Rust et al (2021) find that MBERT's tokenization scheme for many languages is subpar.…”
Section: Related Workmentioning
confidence: 99%
“…Consequently, models such as multilingual BERT [Devlin et al, 2019] XLM-R [Conneau et al, 2020] employ the same subword tokenization algorithms as monolingual models, now applied to a massively multilingual corpus. In the multilingual setting, the problems of subword-based tokenization are exacerbated as tokens in languages with few data are over-segmented while high-frequency tokens are under-segmented, which limits cross-lingual transfer [Wang et al, 2021]. This motivates our work as well as recent work on character-level models.…”
Section: Related Workmentioning
confidence: 99%
“…To make a model more robust to morphological and compositional generalization, probabilistic segmentation algorithms such as subword regularization [Kudo, 2018] and BPE-dropout [Provilkov et al, 2020] have been proposed, which sample different segmentations during training. Recent methods propose to make models more robust for downstream tasks by enforcing prediction consistency between deterministic and probabilistic segmentations [Wang et al, 2021] and propose to update the tokenizer based on the downstream loss under different segmentations [Hiraoka et al, 2020[Hiraoka et al, , 2021. [He et al, 2020] proposed DPE (dynamic programming encoding), a segmentation-based tokenization algorithm based on dynamic programming.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…On the other hand, many adaptation techniques have focused on improving representation of the target language by modifying the model's vocabulary or tokenization schemes (Chung et al, 2020;Clark et al, 2021;Wang et al, 2021). This is well-motivated: Artetxe et al (2020) emphasize representation in the vocabulary as a key factor for effective crosslingual transfer, while Rust et al (2020) find that MBERT's tokenization scheme for many languages is subpar.…”
Section: Related Workmentioning
confidence: 99%