Generalized Data Augmentation for Low-Resource Translation

Xia, Mengzhou; Kong, Xiang; Anastasopoulos, Antonios; Neubig, Graham

doi:10.18653/v1/p19-1579

Cited by 86 publications

(50 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The biggest disadvantage of these methods is not reserving meaning concerning the context of the sentences, so we present more complex approaches retaining the meaning as the original sentence. Back translation aims to obtain more training samples based on the translators, many research teams have used to improve translation models [12][13][14][15]23]. This technique is resolved by using the translators to translate the original data to a certain language, after that taking the translated data into the independent translator to translate back to the original language.…”

Section: Data Augmentationmentioning

confidence: 99%

A review: preprocessing techniques and data augmentation for sentiment analysis

Duong

Nguyen-Thi

2021

Comput Soc Netw

View full text Add to dashboard Cite

In literature, the machine learning-based studies of sentiment analysis are usually supervised learning which must have pre-labeled datasets to be large enough in certain domains. Obviously, this task is tedious, expensive and time-consuming to build, and hard to handle unseen data. This paper has approached semi-supervised learning for Vietnamese sentiment analysis which has limited datasets. We have summarized many preprocessing techniques which were performed to clean and normalize data, negation handling, intensification handling to improve the performances. Moreover, data augmentation techniques, which generate new data from the original data to enrich training data without user intervention, have also been presented. In experiments, we have performed various aspects and obtained competitive results which may motivate the next propositions.

show abstract

Section: Data Augmentationmentioning

confidence: 99%

A review: preprocessing techniques and data augmentation for sentiment analysis

Duong

Nguyen-Thi

2021

Comput Soc Netw

View full text Add to dashboard Cite

show abstract

“…Our approach bears similarities to pseudo-corpus approaches that have been used in machine translation (MT), where low-resource language data are augmented with data generated from a related highresource language. Among many, for instance, De Gispert and Marino (2006) built a Catalan-English MT by bridging through Spanish, while Xia et al (2019) show that word-level substitutions can convert a high-resource (related) language corpus into a pseudo low-resource one leading to large improvements in MT quality. Such approaches typically operate at the word level, hence they do not need to handle script differences explicitly.…”

Section: Introductionmentioning

confidence: 99%

Transliteration for Cross-Lingual Morphological Inflection

Murikinati¹,

Anastasopoulos²,

Neubig³

2020

Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

Self Cite

View full text Add to dashboard Cite

Cross-lingual transfer between typologically related languages has been proven successful for the task of morphological inflection. However, if the languages do not share the same script, current methods yield more modest improvements. We explore the use of transliteration between related languages, as well as grapheme-to-phoneme conversion, as data preprocessing methods in order to alleviate this issue. We experimented with several diverse language pairs, finding that in most cases transliterating the transfer language data into the target one leads to accuracy improvements, even up to 9 percentage points. Converting both languages into a shared space like the International Phonetic Alphabet or the Latin alphabet is also beneficial, leading to improvements of up to 16 percentage points. 1

show abstract

“…Recently, there has been growing interest in low-resource NLP, with work in part-of-speech tagging (Plank and Agić, 2018), parsing (Rasooli and Collins, 2017), machine translation (Xia et al, 2019), and other fields. Low-resource NER has seen work using Wikipedia (Tsai et al, 2016), self attention (Xie et al, 2018), and multilingual contextual representations (Wu and Dredze, 2019).…”

Section: Related Workmentioning

confidence: 99%

Building Low-Resource NER Models Using Non-Speaker Annotations

Tsygankova¹,

Marini²,

Mayhew³

et al. 2021

Proceedings of the Second Workshop on Data Science With Human in the Loop: Language Advances

View full text Add to dashboard Cite

In low-resource natural language processing (NLP), the key problems are a lack of target language training data, and a lack of native speakers to create it. Cross-lingual methods have had notable success in addressing these concerns, but in certain common circumstances, such as insufficient pretraining corpora or languages far from the source language, their performance suffers. In this work we propose a complementary approach to building low-resource Named Entity Recognition (NER) models using "non-speaker" (NS) annotations, provided by annotators with no prior experience in the target language. We recruit 30 participants in a carefully controlled annotation experiment with Indonesian, Russian, and Hindi. We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations, and have the potential to outperform with additional effort. We conclude with observations of common annotation patterns and recommended implementation practices, and motivate how NS annotations can be used in addition to prior methods for improved performance. 1

show abstract

Generalized Data Augmentation for Low-Resource Translation

Cited by 86 publications

References 30 publications

A review: preprocessing techniques and data augmentation for sentiment analysis

A review: preprocessing techniques and data augmentation for sentiment analysis

Transliteration for Cross-Lingual Morphological Inflection

Building Low-Resource NER Models Using Non-Speaker Annotations

Contact Info

Product

Resources

About