This year the workshop included a shared task to quantitatively evaluate competing methods for extracting parallel sentences from comparable monolingual corpora, so as to give an overview on the state of the art and to identify the best performing approaches. 13 runs were submitted in time to the shared task by 4 teams, covering three of the four proposed language pairs: French-English (7 runs), GermanEnglish (3 runs), and Chinese-English (3 runs). We make the datasets are available on the workshop Web page at https://comparable.limsi.fr/bucc2017/bucc2017-task.html.
AbstractDespite numerous studies devoted to mining parallel material from bilingual data, we have yet to see the resulting technologies wholeheartedly adopted by professional translators and terminologists alike. I argue that this state of affairs is mainly due to two factors: the emphasis published authors put on models (even though data is as important), and the conspicuous lack of concern for actual end-users.
IntroductionParallel corpora (documents collections that are translations of one another) are the bread and butter of machine translation (MT). Solutions have been proposed for mining parallel texts found on the Web (Chen and Nie, 2000;Resnik and Smith, 2003), and for aligning sentences in parallel documents (Gale and Church, 1993), leading to socalled "bitexts". It then becomes possible to align words in parallel sentence pairs, in an unsupervised way (Brown et al., 1993).Because parallel data is relatively rare, researchers have turned to exploiting comparable corpora, e.g. news articles in different languages covering the same event. Sharoff et al. (2013) thoroughly examine this topic. It is noteworthy that researchers know quite well how to identify parallel sentences in a comparable corpus , and can then use "tried and true" procedures for extracting bilingual lexicons from such a resource (Rapp, 1995; Fung, 1995; Mikolov et al., 2013).Being able to benefit from both parallel and comparable data is quite an accomplishment from a scientific point of view, and progress is still being made on the task. In contrast, and frustratingly, the technologies that professional translators are adopting continue to rely mainly on sentencebased translation memories. I do not mean to say that other technologies are not being used. For instance, translation agencies are increasingly integrating machine translation into their workflow, but this is mostly driven by cost reduction, and not by a genuine interest in MT on the part of translators, who remain unconvinced.I submit that this limited adoption of new resources and technologies is due to the conjunction of two factors: the overall lack of concern for actual users, and the clear preference of the research community for the study of models at the cost of research on data. Of course, improvements on models have the potential to impact users. Notably, recent studies (Bentivogli et al., 2016; Isabelle et al., 2017) confirm that neural MT (Sutskever et al., 2014;Cho et al., 2014;Bahdanau et al., 201...