Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume 2021
DOI: 10.18653/v1/2021.eacl-main.200
|View full text |Cite
|
Sign up to set email alerts
|

Lexical Normalization for Code-switched Data and its Effect on POS Tagging

Abstract: Lexical normalization, the translation of noncanonical data to standard language, has shown to improve the performance of many natural language processing tasks on social media. Yet, using multiple languages in one utterance, also called code-switching (CS), is frequently overlooked by these normalization systems, despite its common use in social media. In this paper, we propose three normalization models specifically designed to handle codeswitched data which we evaluate for two language pairs: Indonesian-Eng… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
2
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

2
6

Authors

Journals

citations
Cited by 11 publications
(3 citation statements)
references
References 23 publications
0
2
0
1
Order By: Relevance
“…Indeed, in our experience speakers write ''the way words sound'' in their local variants, using just the available characters in their keyboards. Normalizing user-generated texts to a ''standard'' form (e.g., Baldwin et al, 2015;van der Goot et al, 2020van der Goot et al, , 2021a has proven useful for NLP purposes, but it inevitably erases the naturally occurring sociolinguistic variation (Nguyen et al, 2021), homogenizing all variants of a language variety and imposing a ''correct'' form of writing.…”
Section: Uniform Functions Contexts and Needsmentioning
confidence: 99%
“…Indeed, in our experience speakers write ''the way words sound'' in their local variants, using just the available characters in their keyboards. Normalizing user-generated texts to a ''standard'' form (e.g., Baldwin et al, 2015;van der Goot et al, 2020van der Goot et al, , 2021a has proven useful for NLP purposes, but it inevitably erases the naturally occurring sociolinguistic variation (Nguyen et al, 2021), homogenizing all variants of a language variety and imposing a ''correct'' form of writing.…”
Section: Uniform Functions Contexts and Needsmentioning
confidence: 99%
“…The dataset is available through a GitHub repository. Besides these monolingual resources, a normalization dataset for Turkish-German is also available (van der Goot and Çetinoğlu, 2021). This dataset is a revised version of the data from Çetinoğlu and Çöltekin (2016) for normalization by employing token-level alignment layers and adapting existing language ID and POS tags for these new layers.…”
Section: Social Media Text Normalization Corporamentioning
confidence: 99%
“…Avrupa ve Amerika'da Valentina Day diye geçer. Turkish-German (van der Goot and Çetinoglu, 2021) artik ablamdan bise yuruturum napim :D Artık ablamdan bir şey yürütürüm ne yapayım :D Table 2: Noisy examples from each language and the corresponding canonical forms.…”
Section: Language Example Raw Example Goldunclassified