Proceedings of the Seventh Workshop on Noisy User-Generated Text (W-Nut 2021) 2021
DOI: 10.18653/v1/2021.wnut-1.53
|View full text |Cite
|
Sign up to set email alerts
|

Sequence-to-Sequence Lexical Normalization with Multilingual Transformers

Abstract: Current benchmark tasks for natural language processing contain text that is qualitatively different from the text used in informal day to day digital communication. This discrepancy has led to severe performance degradation of state-of-the-art NLP models when fine-tuned on real-world data. One way to resolve this issue is through lexical normalization, which is the process of transforming non-standard text, usually from social media, into a more standardized form. In this work, we propose a sentence-level seq… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 9 publications
(11 citation statements)
references
References 21 publications
(24 reference statements)
0
11
0
Order By: Relevance
“…However, in this study, a simpler technique was used, which involved term standardisation. Term standardisation assures homogeneity and harmonisation throughout the texts and decreases the needed efforts for further text processing [34,35,36]. In addition, through the standardisation process, synonyms, slang, abbreviations, and other related aspects can be standardised, which potentially enhances the ability of LDA in identifying topic terms as LDA considers the distribution and frequency of words in the documents.…”
Section: B Topic Modelling Using Ldamentioning
confidence: 99%
“…However, in this study, a simpler technique was used, which involved term standardisation. Term standardisation assures homogeneity and harmonisation throughout the texts and decreases the needed efforts for further text processing [34,35,36]. In addition, through the standardisation process, synonyms, slang, abbreviations, and other related aspects can be standardised, which potentially enhances the ability of LDA in identifying topic terms as LDA considers the distribution and frequency of words in the documents.…”
Section: B Topic Modelling Using Ldamentioning
confidence: 99%
“…Moreover, the text in the MEMOTION 2.0 dataset is cleaned by human annotators. However, for a large-scale meme dataset used for pretraining, one can employ lexical normalization models [56,57] to automatically correct faulty OCR and transform the text to its canonical form, which was a significant problem in computational pipelines from the first edition of this shared task.…”
Section: Multi-modal Experimentsmentioning
confidence: 99%
“…With the emergence of general purpose language models, many recent papers present work on using such models for text normalization. BERT (Muller et al, 2019;Plank et al, 2020), BART (Bucur et al, 2021) and RoBERTa (Kubal and Nagvenkar, 2021), for instance, have all been use lately to solve the task.…”
Section: Related Workmentioning
confidence: 99%