Learning Bilingual Sentence Embeddings via Autoencoding and Computing Similarities with a Multilayer Perceptron

Kim, Yunsu; Rosendahl, Hendrik; Rossenbach, Nick; Rosendahl, Jan; Khadivi, Shahram; Ney, Hermann

doi:10.18653/v1/w19-4309

Cited by 3 publications

(5 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, bilingual sentence embedding or word embedding are used to calculate similarity of a sentence pair [1,3,4,7,8,13,14,23,25,27,32]. [18] built bilingual representation of a sentence by averaging pre-trained bilingual word embeddings.…”

Section: Related Workmentioning

confidence: 99%

“…This problem may also appear in other distant language pairs, such as English-Japanese, English-Korean , and so on. To address this problem, [23] learnt bilingual sentence embeddings from a combination of parallel and monolingual data. Then, they connected autoencoding and neural machine translation to force the source and target sentence embeddings to share the same space without the help of a pivot language or an additional transformation.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Unsupervised Parallel Sentences of Machine Translation for Asian Language Pairs

Zhu

et al. 2023

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

View full text Add to dashboard Cite

Parallel sentence pairs play a very important role in many natural language processing (NLP) tasks, especially cross-lingual tasks such as machine translation. So far, many Asian language pairs lack bilingual parallel sentences. As collecting bilingual parallel data is very time-consuming and difficult, it is very important for many low-resource Asian language pairs. While existing methods have shown encouraging results, they rely on bilingual data seriously or have some drawbacks in an unsupervised situation. To address these issues, we propose a new unsupervised similarity calculation and dynamic selection metric to obtain parallel sentence pairs in an unsupervised situation. First, our method maps bilingual word embedding (BWE) by postdoc adversarial training which rotates the source space to match the target without parallel data. Then, we introduce a new cross-domain similarity adaption to obtain parallel sentence pairs. Experimental results on real-world datasets show that our model can obtain better accuracy and recall on mining parallel sentence pairs. We also show that the extracted bilingual sentence corpora can significantly improve the performance of neural machine translation.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Unsupervised Parallel Sentences of Machine Translation for Asian Language Pairs

Zhu

et al. 2023

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

View full text Add to dashboard Cite

show abstract

“…Currently, most sentence similarity studies mainly focus on monolingual languages and have achieved good results, and cross-lingual sentence similarity has also achieved good results on some high-resource languages [14]. In Xinjiang, on the other hand, cross-linguistic sentence similarity research on small languages, mainly Uyghur, still needs more attention.…”

Section: Introductionmentioning

confidence: 99%

A Cross-Lingual Sentence Similarity Calculation Method With Multifeature Fusion

et al. 2022

View full text Add to dashboard Cite

Cross-language sentence similarity computation is among the focuses of research in natural language processing (NLP). At present, some researchers have introduced fine-grained word and character features to help models understand sentence meanings, but they do not consider coarse-grained prior knowledge at the sentence level. Even if two cross-linguistic sentence pairs have the same meaning, the sentence representations extracted by the baseline approach may have language-specific biases. Considering the above problems, in this paper, we construct a Chinese-Uyghur cross-lingual sentence similarity dataset and propose a method to compute cross-lingual sentence similarity by fusing multiple features. The method is based on the cross-lingual pretraining model XLM-RoBERTa and assists the model in similarity calculation by introducing two coarse-grained prior knowledge features, i.e., sentence sentiment and length features. At the same time, to eliminate possible language-specific biases in the vectors, we whitened the sentence vectors of different languages to ensure that they were all represented under the standard orthogonal basis. Considering that the combination of different vectors has different effects on the final performance of the model, we introduce different vector features for comparison experiments based on the basic feature splicing method. The results show that the absolute value feature of the difference between two vectors can reflect the similarity of two sentences well. The final F1 value of our method reaches 98.97%, which is 19.81% higher than that of the baseline.

show abstract

“…Cross-lingual sentence representation models (Schwenk and Douze, 2017;España-Bonet et al, 2017;Yu et al, 2018;Devlin et al, 2019;Chidambaram et al, 2019;Artetxe and Schwenk, 2019b;Kim et al, 2019;Sabet et al, 2019;Conneau and Lample, 2019;Feng et al, 2020;Li 1 https://github.com/Mao-KU/ lightweight-crosslingual-sent2vec and Mak, 2020) learn language-agnostic representations facilitating tasks like cross-lingual sentence retrieval (XSR) and cross-lingual knowledge transfer on downstream tasks without the need for training a new monolingual representation model from scratch. Thus, such models benefit from an increased amount of data during training and lead to improved performances for low-resource languages.…”

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Lightweight Cross-Lingual Sentence Representation Learning

Mao¹,

Gupta²,

Chu³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Large-scale models for learning fixeddimensional cross-lingual sentence representations like LASER (Artetxe and Schwenk, 2019b) lead to significant improvement in performance on downstream tasks. However, further increases and modifications based on such large-scale models are usually impractical due to memory limitations. In this work, we introduce a lightweight dual-transformer architecture with just 2 layers for generating memory-efficient cross-lingual sentence representations. We explore different training tasks and observe that current cross-lingual training tasks leave a lot to be desired for this shallow architecture. To ameliorate this, we propose a novel cross-lingual language model, which combines the existing single-word masked language model with the newly proposed cross-lingual token-level reconstruction task. We further augment the training task by the introduction of two computationally-lite sentence-level contrastive learning tasks to enhance the alignment of cross-lingual sentence representation space, which compensates for the learning bottleneck of the lightweight transformer for generative tasks. Our comparisons with competing models on cross-lingual sentence retrieval and multilingual document classification confirm the effectiveness of the newly proposed training tasks for a shallow model. 1

show abstract

Learning Bilingual Sentence Embeddings via Autoencoding and Computing Similarities with a Multilayer Perceptron

Cited by 3 publications

References 29 publications

Unsupervised Parallel Sentences of Machine Translation for Asian Language Pairs

Unsupervised Parallel Sentences of Machine Translation for Asian Language Pairs

A Cross-Lingual Sentence Similarity Calculation Method With Multifeature Fusion

Lightweight Cross-Lingual Sentence Representation Learning

Contact Info

Product

Resources

About