Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.423
|View full text |Cite
|
Sign up to set email alerts
|

Tackling the Low-resource Challenge for Canonical Segmentation

Abstract: Canonical morphological segmentation consists of dividing words into their standardized morphemes. Here, we are interested in approaches for the task when training data is limited. We compare model performance in a simulated low-resource setting for the highresource languages German, English, and Indonesian to experiments on new datasets for the truly low-resource languages Popoluca and Tepehua. We explore two new models for the task, borrowing from the closely related area of morphological generation: an LSTM… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(10 citation statements)
references
References 40 publications
0
9
0
Order By: Relevance
“…Our study also relates to computational work on derivational morphology Cotterell and Schütze, 2018;Deutsch et al, 2018;Hofmann et al, 2020a,b,c) and word segmentation Kann et al, 2016;Ruzsics and Samardžić, 2017;Mager et al, 2019Mager et al, , 2020Seker and Tsarfaty, 2020;Amrhein and Sennrich, 2021). We are the first to systematically evaluate the segmentations of PLM tokenizers on human-annotated gold data.…”
Section: Related Workmentioning
confidence: 93%
“…Our study also relates to computational work on derivational morphology Cotterell and Schütze, 2018;Deutsch et al, 2018;Hofmann et al, 2020a,b,c) and word segmentation Kann et al, 2016;Ruzsics and Samardžić, 2017;Mager et al, 2019Mager et al, , 2020Seker and Tsarfaty, 2020;Amrhein and Sennrich, 2021). We are the first to systematically evaluate the segmentations of PLM tokenizers on human-annotated gold data.…”
Section: Related Workmentioning
confidence: 93%
“…Morpheme segmentation is a well-established task in computational linguistics (cf. Mager et al (2020)). Recently, two definitions of morpheme segmentations have emerged: "Shallow segmentation" and "canonical segmentation" (Kann et al, 2016).…”
Section: Related Workmentioning
confidence: 99%
“…There are both supervised (e.g. pointer generator networks (Mager et al, 2020)) and unsupervised approaches (e.g. the Morfessor family of methods (Creutz and Lagus, 2002;Poon and Domingos, 2009) or Adaptor Grammars (Eskander et al, 2019)), where the former ones have outperformed the latter ones.…”
Section: Morphological Segmentation and Analysismentioning
confidence: 99%
“…• Morfessor (Poon and Domingos, 2009). 4 We analysed several vocabulary sizes (4k, 8k, 16k, 32k, 64k) • Pointer Generator Network (PtrNet) from the implementation of Mager et al (2020).…”
Section: Synthesis: Automatic Computationmentioning
confidence: 99%