José Carlos Rosales Núñez scite author profile

José Carlos Rosales Núñez

4Publications

0Citation Statements Received

46Citation Statements Given

How they've been cited

How they cite others

Affiliations

Publications

Order By: Most citations

Noisy UGC Translation at the Character Level: Revisiting Open-Vocabulary Capabilities and Robustness of Char-Based Models

Núñez¹,

Wisniewski²,

Seddah³

2021

View full text Add to dashboard Cite

This work explores the capacities of characterbased Neural Machine Translation to translate noisy User-Generated Content (UGC) with a strong focus on exploring the limits of such approaches to handle productive UGC phenomena, which almost by definition, cannot be seen at training time. Within a strict zeroshot scenario, we first study the detrimental impact on translation performance of various user-generated content phenomena on a small annotated dataset we developed, and then show that such models are indeed incapable of handling unknown letters, which leads to catastrophic translation failure once such characters are encountered. We further confirm this behavior with a simple, yet insightful, copy task experiment and highlight the importance of reducing the vocabulary size hyper-parameter to increase the robustness of character-based models for machine translation.

show abstract

Understanding the Impact of UGC Specificities on Translation Quality

Núñez¹,

Seddah²,

Wisniewski³

2021

View full text Add to dashboard Cite

This work takes a critical look at the evaluation of user-generated content automatic translation, the well-known specificities of which raise many challenges for MT. Our analyses show that measuring the average-case performance using a standard metric on a UGC test set falls far short of giving a reliable image of the UGC translation quality. That is why we introduce a new data set for the evaluation of UGC translation in which UGC specificities have been manually annotated using a finegrained typology. Using this data set, we conduct several experiments to measure the impact of different kinds of UGC specificities on translation quality, more precisely than previously possible.

show abstract

Phonetic Normalization for Machine Translation of User Generated Content

Núñez¹,

Seddah

Wisniewski

2019

View full text Add to dashboard Cite

We present an approach to correct noisy User Generated Content (UGC) in French aiming to produce a pre-processing pipeline to improve Machine Translation for this kind of noncanonical corpora. Our approach leverages the fact that some errors are due to confusion induced by words with similar pronunciation which can be corrected using a phonetic lookup table to produce normalization candidates. We rely on a character-based neural model phonetizer to produce IPA pronunciations of words and a similarity metric based on the IPA representation of words that allow us to identify words with similar pronunciation. These potential corrections are then encoded in a lattice and ranked using a language model to output the most probable corrected phrase. Compared to other phonetizers, our method boosts a Transformer-based machine translation system on UGC.

show abstract

Understanding the Impact of UGC Specificities on Translation Quality

Núñez¹,

Seddah²,

Wisniewski³

2021

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.