MultiLexNorm: A Shared Task on Multilingual Lexical Normalization

Goot, Rob van der; Ramponi, Alan; Zubiaga, Arkaitz; Plank, Barbara; Müller, Benjamin; Roncal, Iñaki San Vicente; Ljubešić, Nikola; Çetinoğlu, Özlem; Mahendra, Rahmad; Çolakoğlu, Talha; Baldwin, Timothy; Caselli, Tommaso; Sidorenko, Wladimir

doi:10.18653/v1/2021.wnut-1.55

Cited by 12 publications

(10 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Interesting future avenues for research include also studying the impact of lexical normalization on downstream abusive language detection and religious hate speech detection performance, using monolingual ( van der Goot et al, 2020 ; Baldwin et al, 2015 ) or multilingual datasets ( van der Goot et al, 2021a ), as well as exploiting multiple annotations on the Italian portion of the dataset to study intersectionality.…”

Section: Discussionmentioning

confidence: 99%

Addressing religious hate online: from taxonomy creation to automated detection

Ramponi

Testa

Tonelli

et al. 2022

PeerJ Computer Science

View full text Add to dashboard Cite

Abusive language in online social media is a pervasive and harmful phenomenon which calls for automatic computational approaches to be successfully contained. Previous studies have introduced corpora and natural language processing approaches for specific kinds of online abuse, mainly focusing on misogyny and racism. A current underexplored area in this context is religious hate, for which efforts in data and methods to date have been rather scattered. This is exacerbated by different annotation schemes that available datasets use, which inevitably lead to poor repurposing of data in wider contexts. Furthermore, religious hate is very much dependent on country-specific factors, including the presence and visibility of religious minorities, societal issues, historical background, and current political decisions. Motivated by the lack of annotated data specifically tailoring religion and the poor interoperability of current datasets, in this article we propose a fine-grained labeling scheme for religious hate speech detection. Such scheme lies on a wider and highly-interoperable taxonomy of abusive language, and covers the three main monotheistic religions: Judaism, Christianity and Islam. Moreover, we introduce a Twitter dataset in two languages—English and Italian—that has been annotated following the proposed annotation scheme. We experiment with several classification algorithms on the annotated dataset, from traditional machine learning classifiers to recent transformer-based language models, assessing the difficulty of two tasks: abusive language detection and religious hate speech detection. Finally, we investigate the cross-lingual transferability of multilingual models on the tasks, shedding light on the viability of repurposing our dataset for religious hate speech detection on low-resource languages. We release the annotated data and publicly distribute the code for our classification experiments at https://github.com/dhfbk/religious-hate-speech.

show abstract

Section: Discussionmentioning

confidence: 99%

Addressing religious hate online: from taxonomy creation to automated detection

Ramponi

Testa

Tonelli

et al. 2022

PeerJ Computer Science

View full text Add to dashboard Cite

show abstract

“…However, in this study, a simpler technique was used, which involved term standardisation. Term standardisation assures homogeneity and harmonisation throughout the texts and decreases the needed efforts for further text processing [34,35,36]. In addition, through the standardisation process, synonyms, slang, abbreviations, and other related aspects can be standardised, which potentially enhances the ability of LDA in identifying topic terms as LDA considers the distribution and frequency of words in the documents.…”

Section: B Topic Modelling Using Ldamentioning

confidence: 99%

Term Standardisation With LDA Model To Detect Service Disruption Events Using English And Manglish Tweets

Yusuf,

Ismail,

Zayet

et al. 2024

JIWE

View full text Add to dashboard Cite

Rapid transit is one of Malaysia's most important transportation modes, where commuters use public transportation to travel. Any disruption in the rapid transit service affects their daily routines. Therefore, detecting such service disruption has become fundamental. In this study, the disruption in Malaysia's rapid transit service was assessed using English and Manglish (a combination of English and Malay) tweets through Latent Dirichlet Allocation (LDA). The gathered tweets were classified into event and non-event tweets and LDA was applied to the event tweets. Manglish event tweets were pre-processed using the proposed term standardisation technique. As a result, LDA has proved its efficiency in topic detection for both English and Manglish tweets with better performance for Manglish tweets; The best event detection rate of the LDA_English model was at the likelihood of 80% while the best detection rate of the LDA_Manglish model was at a likelihood of 60%.

show abstract

“…Some multilingual datasets for question answering (TyDiQA; Clark et al, 2020), common sense reasoning (XCOPA;Ponti et al, 2020), abstractive summarization (Hasan et al, 2021), passage ranking (mMARCO;Bonifacio et al, 2021), crosslingual visual question answering (xGQA;Pfeiffer et al, 2021), language and vision reasoning (MaRVL; Liu et al, 2021), paraphrasing (Para-Cotta;, dialogue systems (XPersona & BiToD; Lin et al, 2021a,b), lexical normalization (MultiLexNorm;van der Goot et al, 2021), and machine translation (FLORES-101; Guzmán et al, 2019) include Indonesian but most others do not, and very few include Indonesian local lan-guages. An exception is the weakly supervised named entity recognition dataset, WikiAnn (Pan et al, 2017), which covers several Indonesian local languages, namely Acehnese, Javanese, Minangkabau, and Sundanese.…”

Section: Efforts In Multilingual Researchmentioning

confidence: 99%

One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

Aji¹,

Winata²,

Koto³

et al. 2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Self Cite

View full text Add to dashboard Cite

NLP research is impeded by a lack of resources and awareness of the challenges presented by underrepresented languages and dialects. Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia's 700+ languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems. Finally, we provide general recommendations to help develop NLP technology not only for languages of Indonesia but also other underrepresented languages.

show abstract

MultiLexNorm: A Shared Task on Multilingual Lexical Normalization

Cited by 12 publications

References 23 publications

Addressing religious hate online: from taxonomy creation to automated detection

Addressing religious hate online: from taxonomy creation to automated detection

Term Standardisation With LDA Model To Detect Service Disruption Events Using English And Manglish Tweets

One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

Contact Info

Product

Resources

About