How low is too low? A monolingual take on lemmatisation in Indian languages

Kumar, Somesh; Kumar, Saurav; Bhattacharyya, Pushpak

doi:10.18653/v1/2021.naacl-main.322

Cited by 3 publications

(3 citation statements)

References 22 publications

(19 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Release of the Universal Dependencies (UD) dataset (de Marneffe et al, 2014,Nivre et al, 2017 and Sigmorphon 2019 shared task formed the basis of encoder-decoder architectures to solve the lemmatization task as a string-transduction task (Qi et al, 2018; Kanerva et al, 2018. For the Bangla language, Saunack et al (2021) employed a similar two-step attention network that took morphological tags and inflected words as input. Islam et al (2022) used PoS tags of each word as additional features to the encoder-decoder network achieving 95.75% accuracy on validation dataset.…”

Section: Related Workmentioning

confidence: 99%

BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla Lemmatizer

Afrin,

Chowdhury,

Islam

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Lemmatization holds significance in both natural language processing (NLP) and linguistics, as it effectively decreases data density and aids in comprehending contextual meaning. However, due to the highly inflected nature and morphological richness, lemmatization in Bangla text poses a complex challenge. In this study, we propose linguistic rules for lemmatization and utilize a dictionary along with the rules to design a lemmatizer specifically for Bangla. Our system aims to lemmatize words based on their parts of speech class within a given sentence. Unlike previous rule-based approaches, we analyzed the suffix marker occurrence according to the morpho-syntactic values and then utilized sequences of suffix markers instead of entire suffixes. To develop our rules, we analyze a large corpus of Bangla text from various domains, sources, and time periods to observe the word formation of inflected words. The lemmatizer achieves an accuracy of 96.36% when tested against a manually annotated test dataset by trained linguists and demonstrates competitive performance on three previously published Bangla lemmatization datasets. We are making the code and datasets publicly available at https://github.com/ eblict-gigatech/BanLemma 1 in order to contribute to the further advancement of Bangla NLP.

show abstract

Section: Related Workmentioning

confidence: 99%

BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla Lemmatizer

Afrin,

Chowdhury,

Islam

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

show abstract

“…Bergmanis and Goldwater (2018) evaluate their models both on the full amount of available data and on 10k samples. Saunack et al (2021) explore the lower bound for training data size on Indian languages: they compare a standard setting with low-resource settings with only 500 and 100 training instances, in which they rely on data augmentation techniques. Saurav et al (2020) investigate cross-lingual approaches for lemmatizing low-resourced Indian languages.…”

Section: Related Workmentioning

confidence: 99%

Lemmatization Experiments on Two Low-Resourced Languages: Low Saxon and Occitan

Miletić,

Siewert

2023

Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

View full text Add to dashboard Cite

We present lemmatization experiments on the unstandardized low-resourced languages Low Saxon and Occitan using two machine-learningbased approaches represented by MaChAmp and Stanza. We show different ways to increase training data by leveraging historical corpora, small amounts of gold data and dictionary information, and discuss the usefulness of this additional data. In the results, we find some differences in the performance of the models depending on the language. This variation is likely to be partly due to differences in the corpora we used, such as the amount of internal variation. However, we also observe common tendencies, for instance that sequential models trained only on gold-annotated data often yield the best overall performance and generalize better to unknown tokens.

show abstract

Section: Related Workmentioning

confidence: 99%

Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

2023

View full text Add to dashboard Cite

Hate speech detection in online platforms has been widely studied in the past. Most of these works were conducted in English and a few rich-resource languages. Recent approaches tailored for low-resource languages have explored the interests of zero-shot cross-lingual transfer learning models in resource-scarce scenarios. However, languages variations between geolects such as American English and British English, Latin-American Spanish, and European Spanish is still a problem for NLP models that often relies on (latent) lexical information for their classification tasks. More importantly, the cultural aspect, crucial for hate speech detection, is often overlooked.In this work, we present the results of a thorough analysis of hate speech detection models performance on different variants of Spanish, including a new hate speech toward immigrants Twitter data set we built to cover these variants. Using mBERT and Beto, a monolingual Spanish Bert-based language model, as the basis of our transfer learning architecture, our results indicate that hate speech detection models for a given Spanish variant are affected when different variations of such language are not considered. Hate speech expressions could vary from region to region where the same language is spoken. * Work conducting during an internship at Inria Paris. 1 Please be aware that this paper contains some examples of offensive slurs that may be considered upsetting. ReferencesResham Ahluwalia, Himani Soni, Edward Callow, Anderson Nascimento, and Martine De Cock. 2018. Detecting hate speech against women in english tweets. EVALITA Evaluation of NLP and Speech Tools for Italian, 12:194.

show abstract

How low is too low? A monolingual take on lemmatisation in Indian languages

Cited by 3 publications

References 22 publications

BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla Lemmatizer

BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla Lemmatizer

Lemmatization Experiments on Two Low-Resourced Languages: Low Saxon and Occitan

Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

Contact Info

Product

Resources

About