Multiword expressions (MWEs) are known as a "pain in the neck" for NLP due to their idiosyncratic behaviour. While some categories of MWEs have been addressed by many studies, verbal MWEs (VMWEs), such as to take a decision, to break one's heart or to turn off, have been rarely modelled. This is notably due to their syntactic variability, which hinders treating them as "words with spaces". We describe an initiative meant to bring about substantial progress in understanding, modelling and processing VMWEs. It is a joint effort, carried out within a European research network, to elaborate universal terminologies and annotation guidelines for 18 languages. Its main outcome is a multilingual 5-millionword annotated corpus which underlies a shared task on automatic identification of VMWEs. This paper presents the corpus annotation methodology and outcome, the shared task organisation and the results of the participating systems.
The paper presents an approach to morphological compound splitting that takes the degree of compositionality into account. We apply our approach to German noun compounds and particle verbs within a German-English SMT system, and study the effect of only splitting compositional compounds as opposed to an aggressive splitting. A qualitative study explores the translational behaviour of non-compositional compounds.
Compounding in morphologically rich languages is a highly productive process which often causes SMT approaches to fail because of unseen words. We present an approach for translation into a compounding language that splits compounds into simple words for training and, due to an underspecified representation, allows for free merging of simple words into compounds after translation. In contrast to previous approaches, we use features projected from the source language to predict compound mergings. We integrate our approach into end-to-end SMT and show that many compounds matching the reference translation are produced which did not appear in the training data. Additional manual evaluations support the usefulness of generalizing compound formation in SMT.
Support-verb constructions (i.e., multiword expressions combining a semantically light verb with a predicative noun) are problematic for standard statistical machine translation systems, because SMT systems cannot distinguish between literal and idiomatic uses of the verb. We work on the German to English translation direction, for which the identification of support-verb constructions is challenging due to the relatively free word order of German. We show that we achieve improved translation quality for verb-object supportverb constructions by marking the verbs when occuring in such constructions. Additional evaluations revealed that our systems produce more correct verb translations than a contrastive baseline system without verb markup.
This paper summarises the contributions of the teams at the University of Helsinki, Uppsala University and the University of Turku to the news translation tasks for translating from and to Finnish. Our models address the problem of treating morphology and data coverage in various ways. We introduce a new efficient tool for word alignment and discuss factorisations, gappy language models and reinflection techniques for generating proper Finnish output. The results demonstrate once again that training data is the most effective way to increase translation performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.