Sentence Boundary Augmentation for Neural Machine Translation Robustness

Li, Daniel; I, Te; Arivazhagan, Naveen; Cherry, Colin; Padfield, Dirk

doi:10.1109/icassp39728.2021.9413492

Cited by 3 publications

(7 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The ASR WER on the test sentences is 9.0%. proach in (Li et al, 2021). According to Table 5, our results yielded a BLEU score of 27.1, which is similar to the score of 27.0 reported in Table 4 of that paper, which represents their best result from training with synthetic segment breaks.…”

Section: Iwslt Resultssupporting

confidence: 83%

“…Finally, we train on (projected-human-source, projected-goldtranslation) pairs. This is similar to how artificial target sentences were constructed by Li et al (2021), but in our case, the boundaries are determined by automatic punctuation on ASR output, rather than from introducing boundary errors at random.…”

Section: Gold Dementioning

confidence: 72%

“…Since these segments need not match the reference sentence boundaries, especially when punctuation is derived automatically on ASR output, we use our Levenshtein alignment as described in Section 3 to align our translation output with the gold-standard translation's segments before evaluating quality with case-sensitive BLEU (Matusov et al, 2005). All models are trained and tested on lowercased and unpunctuated versions of the source, as doing so is known to improve robustness to ASR output (Li et al, 2021).…”

Section: Datamentioning

confidence: 99%

“…We consider a long-form scenario where sentence boundaries for the input audio are not given at test time. As such, the method of Li et al (2021) to make MT robust to segment boundary errors is very relevant. They introduce artificial sentence boundary errors in their training bitext.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Inverted Projection for Robust Speech Translation

Padfield¹,

Cherry²

2021

Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

Self Cite

View full text Add to dashboard Cite

Traditional translation systems trained on written documents perform well for text-based translation but not as well for speech-based applications. We aim to adapt translation models to speech by introducing actual lexical errors from ASR and segmentation errors from automatic punctuation into our translation training data. We introduce an inverted projection approach that projects automatically detected system segments onto human transcripts and then re-segments the gold translations to align with the projected human transcripts. We demonstrate that this overcomes the train-test mismatch present in other training approaches. The new projection approach achieves gains of over 1 BLEU point over a baseline that is exposed to the human transcripts and segmentations, and these gains hold for both IWSLT data and YouTube data.

show abstract

Section: Iwslt Resultssupporting

confidence: 83%

Section: Gold Dementioning

confidence: 72%

Section: Datamentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Inverted Projection for Robust Speech Translation

Padfield¹,

Cherry²

2021

Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Peng et al (2020) Propose dictionary-based DA (DDA) for cross-domain NMT by synthesizing a domain-specific dictionary and automatically generating a pseudo in-domain parallel corpus. Li et al (2020a) Present a DA method using sentence boundary segmentation to improve the robustness of NMT on ASR transcripts. Nishimura et al (2018) Introduce DA methods for multi-source NMT that fills in incomplete portions of multi-source training data.…”

Section: Appendices a Useful Blog Posts And Code Repositoriesmentioning

confidence: 99%

A Survey of Data Augmentation Approaches for NLP

Feng

Gangal

Wei

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

257

View full text Add to dashboard Cite

Data augmentation has recently seen increased interest in NLP due to more work in lowresource domains, new tasks, and the popularity of large-scale neural networks that require large amounts of training data. Despite this recent upsurge, this area is still relatively underexplored, perhaps due to the challenges posed by the discrete nature of language data. In this paper, we present a comprehensive and unifying survey of data augmentation for NLP by summarizing the literature in a structured manner. We first introduce and motivate data augmentation for NLP, and then discuss major methodologically representative approaches. Next, we highlight techniques that are used for popular NLP applications and tasks. We conclude by outlining current challenges and directions for future research. Overall, our paper aims to clarify the landscape of existing literature in data augmentation for NLP and motivate additional work in this area. We also present a GitHub repository with a paper list that will be continuously updated at https://github.com/styfeng/DataAug4NLP.

show abstract

Revitalizing Bahnaric Language through Neural Machine Translation: Challenges, Strategies, and Promising Outcomes

Vo,

Le,

Phan

et al. 2024

AAAI

View full text Add to dashboard Cite

The Bahnar, a minority ethnic group in Vietnam with ancient roots, hold a language of deep cultural and historical significance. The government is prioritizing the preservation and dissemination of Bahnar language through online availability and cross-generational communication. Recent AI advances, including Neural Machine Translation (NMT), have transformed translation with improved accuracy and fluency, fostering language revitalization through learning, communication, and documentation. In particular, NMT enhances accessibility for Bahnar language speakers, making information and content more available. However, translating Vietnamese to Bahnar language faces practical hurdles due to resource limitations, particularly in the case of Bahnar language as an extremely low-resource language. These challenges encompass data scarcity, vocabulary constraints, and a lack of fine-tuning data. To address these, we propose transfer learning from selected pre-trained models to optimize translation quality and computational efficiency, capitalizing on linguistic similarities between Vietnamese and Bahnar language. Concurrently, we apply tailored augmentation strategies to adapt machine translation for the Vietnamese-Bahnar language context. Our approach is validated through superior results on bilingual Vietnamese-Bahnar language datasets when compared to baseline models. By tackling translation challenges, we help revitalize Bahnar language, ensuring information flows freely and the language thrives.

show abstract

Sentence Boundary Augmentation for Neural Machine Translation Robustness

Cited by 3 publications

References 10 publications

Inverted Projection for Robust Speech Translation

Inverted Projection for Robust Speech Translation

A Survey of Data Augmentation Approaches for NLP

Revitalizing Bahnaric Language through Neural Machine Translation: Challenges, Strategies, and Promising Outcomes

Contact Info

Product

Resources

About