Simplify-Then-Translate: Automatic Preprocessing for Black-Box Translation

Mehta, Sneha; Azarnoush, Bahareh; Chen, Boris; Saluja, Avneesh; Misra, Vinith; Bihani, Ballav; Kumar, Ritwik

doi:10.1609/aaai.v34i05.6369

Cited by 21 publications

(14 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similar to machine translation, back-translation is used to improve the performance of neural SS methods (Katsuta and Yamamoto, 2019;Palmero Aprosio et al, 2019;Qiang and Wu, 2021). Mehta et al (2020) trained a paraphrasing model by generating a paraphrase corpus using back-translation, which is used to preprocess source sentences of the low-resource language pairs before feeding into the NMT system.…”

Section: Paraphrase Miningmentioning

confidence: 99%

An Unsupervised Method for Building Sentence Simplification Corpora in Multiple Languages

Lu¹,

Qiang²,

Li³

et al. 2021

Findings of the Association for Computational Linguistics: EMNLP 2021

View full text Add to dashboard Cite

The availability of parallel sentence simplification (SS) is scarce for neural SS modelings. We propose an unsupervised method to build SS corpora from large-scale bilingual translation corpora, alleviating the need for SS supervised corpora. Our method is motivated by the following two findings: neural machine translation model usually tends to generate more high-frequency tokens and the difference of text complexity levels exists between the source and target language of a translation corpus. By taking the pair of the source sentences of translation corpus and the translations of their references in a bridge language, we can construct large-scale pseudo parallel SS data. Then, we keep these sentence pairs with a higher complexity difference as SS sentence pairs. The building SS corpora with an unsupervised approach can satisfy the expectations that the aligned sentences preserve the same meanings and have difference in text complexity levels. Experimental results show that SS methods trained by our corpora achieve the state-of-the-art results and significantly outperform the results on English benchmark Wiki-Large.

show abstract

Section: Paraphrase Miningmentioning

confidence: 99%

An Unsupervised Method for Building Sentence Simplification Corpora in Multiple Languages

Lu¹,

Qiang²,

Li³

et al. 2021

Findings of the Association for Computational Linguistics: EMNLP 2021

View full text Add to dashboard Cite

show abstract

“…While the reordering approach has generally proven effective for SMT, its effectiveness for NMT is not obvious; negative effects have even be reported (Zhu, 2015;Du and Way, 2017). In recent years, techniques of automatic text simplification have been applied to improve NMT outputs ( Štajner and Popović, 2018;Mehta et al, 2020). The underlying assumption of these studies is that simpler sentences are more machine translatable.…”

Section: Related Workmentioning

confidence: 99%

“…However, the feasibility and possibility of preediting for neural MT (NMT) has not been examined extensively. While efforts have recently been invested in the implementation of pre-editing strategies for black-box NMT settings, achieving improved MT quality (e.g., Hiraoka and Yamada, 2019;Mehta et al, 2020), the potential gains of preediting remain unexplored. Notably, the impact of pre-editing on black-box MT is unpredictable in nature.…”

Section: Introductionmentioning

confidence: 99%

Understanding Pre-Editing for Black-Box Neural Machine Translation

Miyata¹,

Fujita²

2021

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

View full text Add to dashboard Cite

Pre-editing is the process of modifying the source text (ST) so that it can be translated by machine translation (MT) in a better quality. Despite the unpredictability of black-box neural MT (NMT), pre-editing has been deployed in various practical MT use cases. Although many studies have demonstrated the effectiveness of pre-editing methods for particular settings, thus far, a deep understanding of what pre-editing is and how it works for black-box NMT is lacking. To elicit such understanding, we extensively investigated human pre-editing practices. We first implemented a protocol to incrementally record the minimum edits for each ST and collected 6,652 instances of preediting across three translation directions, two MT systems, and four text domains. We then analysed the instances from three perspectives: the characteristics of the pre-edited ST, the diversity of pre-editing operations, and the impact of the pre-editing operations on NMT outputs. Our findings include the following: (1) enhancing the explicitness of the meaning of an ST and its syntactic structure is more important for obtaining better translations than making the ST shorter and simpler, and (2) although the impact of pre-editing on NMT is generally unpredictable, there are some tendencies of changes in the NMT outputs depending on the editing operation types.

show abstract

“…These experiments relate to a large body of work that considers how preprocessing methods affect the downstream accuracy of various algorithms, ranging from topics in information retrieval (Chaudhari et al, 2015;Patil and Atique, 2013;Beil et al, 2002), text classification and regression (Forman, 2003;Yang and Pedersen, 1997;Vijayarani et al, 2015;Kumar and Harish, 2018;HaCohen-Kerner et al, 2020;Symeonidis et al, 2018;Weller et al, 2020), topic modeling (Blei et al, 2003;Lund et al, 2019;Schofield and Mimno, 2016;Schofield et al, 2017a,b), and even more complex tasks like question answering (Jijkoun et al, 2003;Carvalho et al, 2007) and machine translation (Habash, 2007;Habash and Sadat, 2006;Leusch et al, 2005;Weller et al, 2021;Mehta et al, 2020) to name a few. With the rise of noisy social media, text preprocessing has become important for tasks that use data from sources like Twitter and Reddit (Symeonidis et al, 2018;Singh and Kumari, 2016;Bao et al, 2014;Jianqiang, 2015;Weller and Seppi, 2020;Zirikly et al, 2019;Babanejad et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Exploring the Relationship Between Algorithm Performance, Vocabulary, and Run-Time in Text Classification

Fearn¹,

Weller²,

Seppi³

2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Text classification is a significant branch of natural language processing, and has many applications including document classification and sentiment analysis. Unsurprisingly, those who do text classification are concerned with the run-time of their algorithms, many of which depend on the size of the corpus' vocabulary due to their bag-of-words representation. Although many studies have examined the effect of preprocessing techniques on vocabulary size and accuracy, none have examined how these methods affect a model's run-time.To fill this gap, we provide a comprehensive study that examines how preprocessing techniques affect the vocabulary size, model performance, and model run-time, evaluating ten techniques over four models and two datasets. We show that some individual methods can reduce run-time with no loss of accuracy, while some combinations of methods can trade 2-5% of the accuracy for up to a 65% reduction of run-time. Furthermore, some combinations of preprocessing techniques can even provide a 15% reduction in run-time while simultaneously improving model accuracy. 1 GroupMethod Vocab Size ↓ Train Time ↓ Test Time ↓ Accuracy ↑

show abstract

Simplify-Then-Translate: Automatic Preprocessing for Black-Box Translation

Cited by 21 publications

References 16 publications

An Unsupervised Method for Building Sentence Simplification Corpora in Multiple Languages

An Unsupervised Method for Building Sentence Simplification Corpora in Multiple Languages

Understanding Pre-Editing for Black-Box Neural Machine Translation

Exploring the Relationship Between Algorithm Performance, Vocabulary, and Run-Time in Text Classification

Contact Info

Product

Resources

About