Parallel Sentence Mining by Constrained Decoding

Chen, Pinzhen; Bogoychev, Nikolay; Heafield, Kenneth; Kirefu, Faheem

doi:10.18653/v1/2020.acl-main.152

Cited by 14 publications

(8 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They obtained more than 10 million parallel sentences in 112 languages. [27] leveraged machine translation model to derive sentence representations. Moreover, [14] has demonstrated that the shared word embeddings space is well in cross-lingual NLP applications by transfer learning.…”

Section: Supervised Parallel Sentences Miningmentioning

confidence: 99%

Mining Parallel Sentences from Internet with Multi-view Knowledge Distillation for Low-resource Language Pairs

Zhu

et al. 2023

Preprint

View full text Add to dashboard Cite

The neural machine translation (NMT), which relies on a large training data (bilingual parallel sentences, for NMT) to obtain the state-of-the-art performance, is similar with deep learning. In order to construct NMT systems, the number of parallel sentences is very important. However, these bilingual resources are scare to many low-resource language pairs. Although several works attempt to obtain bilingual parallel data from Internet, the quality and quantity of mined bilingual corpus are limited for low-resource language pairs. To address this problem, we propose the multi-view knowledge distillation model (MvKD) that use the knowledge of high-resource language pairs transfer into low-resource languages by leveraging internal language-invariant cross different languages. In particular , we treat the mining bilingual parallel sentence pair task as classifying task and use the multi-view classifier to detect bilingual parallel sentence pair. For multi-view classifier, we use two views to recognize the semantic difference of two sentences:(i) word-level representations (ii) sentence-level representations. We encode sentence-level representations to capture semantically similar of two sentences. Moreover, we encode word-level representations to capture word translations in a pair of parallel sentences to avoid the problem that semantically similar but non-parallel sentences. Experimental results demonstrate that our proposed method can significantly mine amount of bilingual corpus and improve the quality of parallel sentences. In particular, we carry out the experiments on several real-world low-resource situations and achieve excellent results.

show abstract

Section: Supervised Parallel Sentences Miningmentioning

confidence: 99%

Mining Parallel Sentences from Internet with Multi-view Knowledge Distillation for Low-resource Language Pairs

Zhu

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…To exploit the event schema knowledge, we propose to employ a trie-based constrained decoding algorithm (Chen et al, 2020a;Cao et al, 2021) for event generation. During constrained decoding, the event schema knowledge is injected as the prompt of the decoder and ensures the generation of valid event structures.…”

Section: Constrained Decodingmentioning

confidence: 99%

“…Like TEXT2EVENT in this paper, TANL (Paolini et al, 2021) and GRIT (Du et al, 2021) also employ neural generation models for event extraction, but they focus on sequence generation, rather than structure generation. Different from previous works that extract text span via labeling (Straková et al, 2019) or copy/pointer mechanism (Zeng et al, 2018;Du et al, 2021), TEXT2EVENT directly generate event schemas and text spans to form event records via constrained decoding (Cao et al, 2021;Chen et al, 2020a), which allows TEXT2EVENT to handle various event types and transfer to new types easily.…”

Section: Related Workmentioning

confidence: 99%

Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction

Lu¹,

Lin²,

Xu³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

116

View full text Add to dashboard Cite

Event extraction is challenging due to the complex structure of event records and the semantic gap between text and event. Traditional methods usually extract event records by decomposing the complex structure prediction task into multiple subtasks. In this paper, we propose TEXT2EVENT, a sequence-to-structure generation paradigm that can directly extract events from the text in an end-to-end manner. Specifically, we design a sequence-to-structure network for unified event extraction, a constrained decoding algorithm for event knowledge injection during inference, and a curriculum learning algorithm for efficient model learning. Experimental results show that, by uniformly modeling all tasks in a single model and universally predicting different labels, our method can achieve competitive performance using only record-level annotations in both supervised learning and transfer learning settings.

show abstract

“…Agrawal et al (2021) investigate alternative techniques to estimate direct translation probability for reference-free quality estimation. In the context of parallel corpus filtering (Junczys-Dowmunt, 2018), Chen et al (2020) propose trie-constrained decoding to improve the efficiency of pairwise comparisons. Future work could apply their method to the other translation-based measures.…”

Section: Related Workmentioning

confidence: 99%

NMTScore: A Multilingual Analysis of Translation-based Text Similarity Measures

Vamvas¹,

Sennrich²

2022

Preprint

View full text Add to dashboard Cite

Being able to rank the similarity of short text segments is an interesting bonus feature of neural machine translation. Translation-based similarity measures include direct and pivot translation probability, as well as translation cross-likelihood, which has not been studied so far. We analyze these measures in the common framework of multilingual NMT, releasing the NMTSCORE library. Compared to baselines such as sentence embeddings, translation-based measures prove competitive in paraphrase identification and are more robust against adversarial or multilingual input, especially if proper normalization is applied. When used for reference-based evaluation of data-to-text generation in 2 tasks and 17 languages, translation-based measures show a relatively high correlation to human judgments.

show abstract

Parallel Sentence Mining by Constrained Decoding

Cited by 14 publications

References 21 publications

Mining Parallel Sentences from Internet with Multi-view Knowledge Distillation for Low-resource Language Pairs

Mining Parallel Sentences from Internet with Multi-view Knowledge Distillation for Low-resource Language Pairs

Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction

NMTScore: A Multilingual Analysis of Translation-based Text Similarity Measures

Contact Info

Product

Resources

About