Proceedings of the Seventh Workshop on Noisy User-Generated Text (W-Nut 2021) 2021
DOI: 10.18653/v1/2021.wnut-1.55
|View full text |Cite
|
Sign up to set email alerts
|

MultiLexNorm: A Shared Task on Multilingual Lexical Normalization

Abstract: Lexical normalization is the task of transforming an utterance into its standardized form. This task is beneficial for downstream analysis, as it provides a way to harmonize (often spontaneous) linguistic variation. Such variation is typical for social media on which information is shared in a multitude of ways, including diverse languages and code-switching. Since the seminal work of Han and Baldwin (2011) a decade ago, lexical normalization has attracted attention in English and multiple other languages. How… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
1

Relationship

2
7

Authors

Journals

citations
Cited by 12 publications
(10 citation statements)
references
References 23 publications
0
5
0
Order By: Relevance
“…Interesting future avenues for research include also studying the impact of lexical normalization on downstream abusive language detection and religious hate speech detection performance, using monolingual ( van der Goot et al, 2020 ; Baldwin et al, 2015 ) or multilingual datasets ( van der Goot et al, 2021a ), as well as exploiting multiple annotations on the Italian portion of the dataset to study intersectionality.…”
Section: Discussionmentioning
confidence: 99%
“…Interesting future avenues for research include also studying the impact of lexical normalization on downstream abusive language detection and religious hate speech detection performance, using monolingual ( van der Goot et al, 2020 ; Baldwin et al, 2015 ) or multilingual datasets ( van der Goot et al, 2021a ), as well as exploiting multiple annotations on the Italian portion of the dataset to study intersectionality.…”
Section: Discussionmentioning
confidence: 99%
“…However, in this study, a simpler technique was used, which involved term standardisation. Term standardisation assures homogeneity and harmonisation throughout the texts and decreases the needed efforts for further text processing [34,35,36]. In addition, through the standardisation process, synonyms, slang, abbreviations, and other related aspects can be standardised, which potentially enhances the ability of LDA in identifying topic terms as LDA considers the distribution and frequency of words in the documents.…”
Section: B Topic Modelling Using Ldamentioning
confidence: 99%
“…Some multilingual datasets for question answering (TyDiQA; Clark et al, 2020), common sense reasoning (XCOPA;Ponti et al, 2020), abstractive summarization (Hasan et al, 2021), passage ranking (mMARCO;Bonifacio et al, 2021), crosslingual visual question answering (xGQA;Pfeiffer et al, 2021), language and vision reasoning (MaRVL; Liu et al, 2021), paraphrasing (Para-Cotta;, dialogue systems (XPersona & BiToD; Lin et al, 2021a,b), lexical normalization (MultiLexNorm;van der Goot et al, 2021), and machine translation (FLORES-101; Guzmán et al, 2019) include Indonesian but most others do not, and very few include Indonesian local lan-guages. An exception is the weakly supervised named entity recognition dataset, WikiAnn (Pan et al, 2017), which covers several Indonesian local languages, namely Acehnese, Javanese, Minangkabau, and Sundanese.…”
Section: Efforts In Multilingual Researchmentioning
confidence: 99%