Minimally-Supervised Morphological Segmentation using Adaptor Grammars with Linguistic Priors

Eskander, Ramy; Lowry, Cass; Khandagale, Sujay; Callejas, Francesca; Klavans, Judith L.; Polinsky, Maria; Muresan, Smaranda

doi:10.18653/v1/2021.findings-acl.347

Cited by 3 publications

(2 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MorphAGram We also include in this study the unsupervised morphology segmenter MorphA-Gram (Eskander et al, 2020) which is based on Adaptor Grammars. We use the PrStSu+SM grammar, which represents a word as a sequence of prefixes followed by a stem then a sequence of suffixes, in the unsupervised Standard learning setting to train the segmenters.…”

Section: Segmentation Systemsmentioning

confidence: 99%

Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text

Gaser,

Mager,

Hamed

et al. 2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

Data sparsity is one of the main challenges posed by code-switching (CS), which is further exacerbated in the case of morphologically rich languages. For the task of machine translation (MT), morphological segmentation has proven successful in alleviating data sparsity in monolingual contexts; however, it has not been investigated for CS settings. In this paper, we study the effectiveness of different segmentation approaches on MT performance, covering morphology-based and frequency-based segmentation techniques. We experiment on MT from code-switched Arabic-English to English. We provide detailed analysis, examining a variety of conditions, such as data size and sentences with different degrees of CS. Empirical results show that morphology-aware segmenters perform the best in segmentation tasks but under-perform in MT. Nevertheless, we find that the choice of the segmentation setup to use for MT is highly dependent on the data size. For extreme low-resource scenarios, a combination of frequency and morphology-based segmentations is shown to perform the best. For more resourced settings, such a combination does not bring significant improvements over the use of frequency-based segmentation.

show abstract

Section: Segmentation Systemsmentioning

confidence: 99%

Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text

Gaser,

Mager,

Hamed

et al. 2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…We configured several baselines: (1) based on a simple weighted Finite-State Transducer (FST) to maximise the morpheme frequency (Richardson and Tyers, 2021), ( 2) based on Morfessor version 2.0 (Virpioja et al, 2013) to learn the morpheme boundaries using minimum description length optimization, and (3) based on the Adaptor Grammar approach. We used the MorphAGram toolkit (Eskander et al, 2020), with two settings: standard setting (AdaGra-Std) and scholar seeded setting (AdaGra-SS). We adopted the best learning settings: the best standard PrefixStemSuffix+SuffixMorph grammar and the best scholar-seeded grammar, as explained in (Eskander et al, 2019), for Innu-Aimun.…”

Section: Training Settingsmentioning

confidence: 99%

Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

2022

View full text Add to dashboard Cite

The lack of resources for languages in the Americas has proven to be a problem for the creation of digital systems such as machine translation, search engines, chat bots, and more. The scarceness of digital resources for a language causes a higher impact on populations where the language is spoken by millions of people. We introduce the first official large combined corpus for deep learning of an indigenous South American low-resource language spoken by millions called Quechua. Specifically, our curated corpus is created from text gathered from the southern region of Peru where a dialect of Quechua is spoken that has not traditionally been used for digital systems as a target dialect in the past. In order to make our work repeatable by others, we also offer a public, pre-trained, BERT model called Qu-BERT which is the largest linguistic model ever trained for any Quechua type, not just the southern region dialect. We furthermore test our corpus and its corresponding BERT model on two major tasks: (1) named-entity recognition (NER) and (2) part-of-speech (POS) tagging by using state-of-the-art techniques where we achieve results comparable to other work on higher-resource languages. In this article, we describe the methodology, challenges, and results from the creation of QuBERT which is on on par with other state-of-the-art multilingual models for natural language processing achieving between 71 and 74% F1 score on NER and 84-87% on POS tasks. ReferencesWillem FH Adelaar. 2004. The languages of the Andes.

show abstract

Unsupervised Morphological Segmentation and Part-of-Speech Tagging for Low-Resource Scenarios

Eskander¹

2021

View full text Add to dashboard Cite

Minimally-Supervised Morphological Segmentation using Adaptor Grammars with Linguistic Priors

Cited by 3 publications

References 13 publications

Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text

Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text

Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

Unsupervised Morphological Segmentation and Part-of-Speech Tagging for Low-Resource Scenarios

Contact Info

Product

Resources

About