Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2021
DOI: 10.18653/v1/2021.naacl-main.322
|View full text |Cite
|
Sign up to set email alerts
|

How low is too low? A monolingual take on lemmatisation in Indian languages

Abstract: Lemmatization aims to reduce the sparse data problem by relating the inflected forms of a word to its dictionary form. Most prior work on ML based lemmatization has focused on high resource languages, where data sets (word forms) are readily available. For languages which have no linguistic work available, especially on morphology or in languages where the computational realization of linguistic rules is complex and cumbersome, machine learning based lemmatizers are the way to go. In this paper, we devote our … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3

Citation Types

0
0
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 22 publications
(19 reference statements)
0
0
0
Order By: Relevance
“…Release of the Universal Dependencies (UD) dataset (de Marneffe et al, 2014,Nivre et al, 2017 and Sigmorphon 2019 shared task formed the basis of encoder-decoder architectures to solve the lemmatization task as a string-transduction task (Qi et al, 2018; Kanerva et al, 2018. For the Bangla language, Saunack et al (2021) employed a similar two-step attention network that took morphological tags and inflected words as input. Islam et al (2022) used PoS tags of each word as additional features to the encoder-decoder network achieving 95.75% accuracy on validation dataset.…”
Section: Related Workmentioning
confidence: 99%
“…Release of the Universal Dependencies (UD) dataset (de Marneffe et al, 2014,Nivre et al, 2017 and Sigmorphon 2019 shared task formed the basis of encoder-decoder architectures to solve the lemmatization task as a string-transduction task (Qi et al, 2018; Kanerva et al, 2018. For the Bangla language, Saunack et al (2021) employed a similar two-step attention network that took morphological tags and inflected words as input. Islam et al (2022) used PoS tags of each word as additional features to the encoder-decoder network achieving 95.75% accuracy on validation dataset.…”
Section: Related Workmentioning
confidence: 99%
“…Bergmanis and Goldwater (2018) evaluate their models both on the full amount of available data and on 10k samples. Saunack et al (2021) explore the lower bound for training data size on Indian languages: they compare a standard setting with low-resource settings with only 500 and 100 training instances, in which they rely on data augmentation techniques. Saurav et al (2020) investigate cross-lingual approaches for lemmatizing low-resourced Indian languages.…”
Section: Related Workmentioning
confidence: 99%
“…Bergmanis and Goldwater (2018) evaluate their models both on the full amount of available data and on 10k samples. Saunack et al (2021) explore the lower bound for training data size on Indian languages: they compare a standard setting with low-resource settings with only 500 and 100 training instances, in which they rely on data augmentation techniques. Saurav et al (2020) investigate cross-lingual approaches for lemmatizing low-resourced Indian languages.…”
Section: Related Workmentioning
confidence: 99%