Knowledge Distillation for Multilingual Unsupervised Neural Machine Translation

Sun, Haipeng; Wang, Rui; Chen, Kehai; Utiyama, Masao; Sumita, Eiichiro; Zhao, Tiejun

doi:10.48550/arxiv.2004.10171

Cited by 6 publications

(4 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The explicated Knowledge Distillation framework has shown its efficiency in a tremendous number of tasks, such as Neural Machine Translation (Tan et al 2019;Wang et al 2021;Li and Li 2021;Sun et al 2020), Question Answering (Hu et al 2018;Arora, Khapra, and Ramaswamy 2019;Yang et al 2020b), Image Classification (Yang et al 2020a;Chen, Chang, and Lee 2018;Fu et al 2020), etc. Nonetheless, its application for Neural Cross-Lingual Summarization has received little interest.…”

Section: Background Neural Cross-lingual Summarizationmentioning

confidence: 99%

Improving Neural Cross-Lingual Abstractive Summarization via Employing Optimal Transport Distance for Knowledge Distillation

Nguyen¹,

Luu

2022

AAAI

View full text Add to dashboard Cite

Current state-of-the-art cross-lingual summarization models employ multi-task learning paradigm, which works on a shared vocabulary module and relies on the self-attention mechanism to attend among tokens in two languages. However, correlation learned by self-attention is often loose and implicit, inefficient in capturing crucial cross-lingual representations between languages. The matter worsens when performing on languages with separate morphological or structural features, making the cross-lingual alignment more challenging, resulting in the performance drop. To overcome this problem, we propose a novel Knowledge-Distillation-based framework for Cross-Lingual Summarization, seeking to explicitly construct cross-lingual correlation by distilling the knowledge of the monolingual summarization teacher into the cross-lingual summarization student. Since the representations of the teacher and the student lie on two different vector spaces, we further propose a Knowledge Distillation loss using Sinkhorn Divergence, an Optimal-Transport distance, to estimate the discrepancy between those teacher and student representations. Due to the intuitively geometric nature of Sinkhorn Divergence, the student model can productively learn to align its produced cross-lingual hidden states with monolingual hidden states, hence leading to a strong correlation between distant languages. Experiments on cross-lingual summarization datasets in pairs of distant languages demonstrate that our method outperforms state-of-the-art models under both high and low-resourced settings.

show abstract

Section: Background Neural Cross-lingual Summarizationmentioning

confidence: 99%

Improving Neural Cross-Lingual Abstractive Summarization via Employing Optimal Transport Distance for Knowledge Distillation

Nguyen¹,

Luu

2022

AAAI

View full text Add to dashboard Cite

show abstract

“…While the focus was originally on single-label image classification, KD has also been extended to the multi-label setting (Liu et al, 2018b). In NLP, KD has usually been applied in supervised settings (Kim and Rush, 2016;Huang et al, 2018;Yang et al, 2020), but also in some unsupervised tasks (usually using an unsupervised teacher for a supervised student) Sun et al, 2020). Xu et al (2018) use word embeddings jointly learned with a topic model in a procedure they term distillation, but do not follow the method from Hinton et al (2015) that we employ (instead opting for joint-learning).…”

Section: Related Workmentioning

confidence: 99%

Improving Neural Topic Models using Knowledge Distillation

Alexander¹,

Goel²,

Resnik³

2020

Preprint

View full text Add to dashboard Cite

Topic models are often used to identify humaninterpretable topics to help make sense of large document collections. We use knowledge distillation to combine the best attributes of probabilistic topic models and pretrained transformers. Our modular method can be straightforwardly applied with any neural topic model to improve topic quality, which we demonstrate using two models having disparate architectures, obtaining state-of-the-art topic coherence. We show that our adaptable framework not only improves performance in the aggregate over all estimated topics, as is commonly reported, but also in head-to-head comparisons of aligned topics.

show abstract

“…• We do not have any parallel data among any of the language pairs, as considered in (Liu et al, 2020;Sun et al, 2020).…”

Section: Terminologymentioning

confidence: 99%

Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages

García¹,

Siddhant²,

Fırat³

et al. 2020

Preprint

View full text Add to dashboard Cite

Unsupervised translation has reached impressive performance on resource-rich language pairs such as English-French and English-German. However, early studies have shown that in more realistic settings involving lowresource, rare languages, unsupervised translation performs poorly, achieving less than 3.0 BLEU. In this work, we show that multilinguality is critical to making unsupervised systems practical for low-resource settings. In particular, we present a single model for 5 low-resource languages (Gujarati, Kazakh, Nepali, Sinhala, and Turkish) to and from English directions, which leverages monolingual and auxiliary parallel data from other highresource language pairs via a three-stage training scheme. We outperform all current stateof-the-art unsupervised baselines for these languages, achieving gains of up to 14.4 BLEU. Additionally, we outperform a large collection of supervised WMT submissions for various language pairs as well as match the performance of the current state-of-the-art supervised model for NeÑEn. We conduct a series of ablation studies to establish the robustness of our model under different degrees of data quality, as well as to analyze the factors which led to the superior performance of the proposed approach over traditional unsupervised models.

show abstract

Knowledge Distillation for Multilingual Unsupervised Neural Machine Translation

Cited by 6 publications

References 26 publications

Improving Neural Cross-Lingual Abstractive Summarization via Employing Optimal Transport Distance for Knowledge Distillation

Improving Neural Cross-Lingual Abstractive Summarization via Employing Optimal Transport Distance for Knowledge Distillation

Improving Neural Topic Models using Knowledge Distillation

Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages

Contact Info

Product

Resources

About