Sequence-to-Sequence Lexical Normalization with Multilingual Transformers

Bucur, Ana-Maria; Cosma, Adrian; Dinu, Liviu P.

doi:10.18653/v1/2021.wnut-1.53

Cited by 9 publications

(11 citation statements)

References 21 publications

(24 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, in this study, a simpler technique was used, which involved term standardisation. Term standardisation assures homogeneity and harmonisation throughout the texts and decreases the needed efforts for further text processing [34,35,36]. In addition, through the standardisation process, synonyms, slang, abbreviations, and other related aspects can be standardised, which potentially enhances the ability of LDA in identifying topic terms as LDA considers the distribution and frequency of words in the documents.…”

Section: B Topic Modelling Using Ldamentioning

confidence: 99%

Term Standardisation With LDA Model To Detect Service Disruption Events Using English And Manglish Tweets

Yusuf,

Ismail,

Zayet

et al. 2024

JIWE

View full text Add to dashboard Cite

Rapid transit is one of Malaysia's most important transportation modes, where commuters use public transportation to travel. Any disruption in the rapid transit service affects their daily routines. Therefore, detecting such service disruption has become fundamental. In this study, the disruption in Malaysia's rapid transit service was assessed using English and Manglish (a combination of English and Malay) tweets through Latent Dirichlet Allocation (LDA). The gathered tweets were classified into event and non-event tweets and LDA was applied to the event tweets. Manglish event tweets were pre-processed using the proposed term standardisation technique. As a result, LDA has proved its efficiency in topic detection for both English and Manglish tweets with better performance for Manglish tweets; The best event detection rate of the LDA_English model was at the likelihood of 80% while the best detection rate of the LDA_Manglish model was at a likelihood of 60%.

show abstract

Section: B Topic Modelling Using Ldamentioning

confidence: 99%

Term Standardisation With LDA Model To Detect Service Disruption Events Using English And Manglish Tweets

Yusuf,

Ismail,

Zayet

et al. 2024

JIWE

View full text Add to dashboard Cite

show abstract

“…Moreover, the text in the MEMOTION 2.0 dataset is cleaned by human annotators. However, for a large-scale meme dataset used for pretraining, one can employ lexical normalization models [56,57] to automatically correct faulty OCR and transform the text to its canonical form, which was a significant problem in computational pipelines from the first edition of this shared task.…”

Section: Multi-modal Experimentsmentioning

confidence: 99%

BLUE at Memotion 2.0 2022: You have my Image, my Text and my Transformer

Bucur¹,

Cosma²,

Ioan-Bogdan³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Memes are prevalent on the internet and continue to grow and evolve alongside our culture. An automatic understanding of memes propagating on the internet can shed light on the general sentiment and cultural attitudes of people. In this work, we present team BLUE's solution for the second edition of the MEMOTION shared task. We showcase two approaches for meme classification (i.e. sentiment, humour, offensive, sarcasm and motivation levels) using a text-only method using BERT, and a Multi-Modal-Multi-Task transformer network that operates on both the meme image and its caption to output the final scores. In both approaches, we leverage state-of-the-art pretrained models for text (BERT, Sentence Transformer) and image processing (EfficientNetV4, CLIP). Through our efforts, we obtain first place in task A, second place in task B and third place in task C. In addition, our team obtained the highest average score for all three tasks.

show abstract

“…With the emergence of general purpose language models, many recent papers present work on using such models for text normalization. BERT (Muller et al, 2019;Plank et al, 2020), BART (Bucur et al, 2021) and RoBERTa (Kubal and Nagvenkar, 2021), for instance, have all been use lately to solve the task.…”

Section: Related Workmentioning

confidence: 99%

Help from the Neighbors: Estonian Dialect Normalization Using a Finnish Dialect Generator

Hämäläinen¹,

Tuisk²

2022

Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

View full text Add to dashboard Cite

While standard Estonian is not a low-resourced language, the different dialects of the language are under-resourced from the point of view of NLP, given that there are no vast hand normalized resources available for training a machine learning model to normalize dialectal Estonian to standard Estonian. In this paper, we crawl a small corpus of parallel dialectal Estonianstandard Estonian sentences. In addition, we take a savvy approach of generating more synthetic training data for the normalization task by using an existing dialect generator model built for Finnish to "dialectalize" standard Estonian sentences from the Universal Dependencies tree banks. Our BERT based normalization model achieves a word error rate that is 26.49 points lower when using both the synthetic data and Estonian data in comparison to training the model with only the available Estonian data. Our results suggest that synthetic data generated by a model trained on a more resourced related language can indeed boost the results for a less resourced language.

show abstract

Sequence-to-Sequence Lexical Normalization with Multilingual Transformers

Cited by 9 publications

References 21 publications

Term Standardisation With LDA Model To Detect Service Disruption Events Using English And Manglish Tweets

Term Standardisation With LDA Model To Detect Service Disruption Events Using English And Manglish Tweets

BLUE at Memotion 2.0 2022: You have my Image, my Text and my Transformer

Help from the Neighbors: Estonian Dialect Normalization Using a Finnish Dialect Generator

Contact Info

Product

Resources

About