BUCC Shared Task: Cross-Language Document Similarity

Sharoff, Serge; Zweigenbaum, Pierre; Rapp, Reinhard

doi:10.18653/v1/w15-3411

Cited by 12 publications

(15 citation statements)

References 3 publications

(4 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we present some alignment methods of the state-of-the-art, and we experiment them in Section 6. We present the hapax-based method and a dictionary-based (DB) method that have been used in BUCC 2015 shared task campaign (Sharoff et al, 2015). This campaign asked the competitors to retrieve comparable documents from a Wikipedia corpus.…”

Section: Experimented Alignment Methodsmentioning

confidence: 99%

“…There are two kinds of methods are tested in this paper to reach the objective of aligning documents; the first kind is based on lexical information and the second relies on latent information. We compare these methods on three different multilingual corpora; one is provided by the evaluation campaign: BUCC 2015 (Sharoff, Zweigenbaum, and Rapp, 2015). This corpus is extracted from Wikipedia, the experiments consist in aligning French and English documents.…”

Section: Introductionmentioning

confidence: 99%

“…We compare these methods on three different multilingual corpora; one is provided by the evaluation campaign: BUCC 2015 (Sharoff, Zweigenbaum and Rapp, 2015). This corpus is extracted from Wikipedia, the experiments consist in aligning French and English documents; The second one is more suitable for our objective since it concerns the press.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data

Langlois

Saad²,

Smaïli

2018

Nat. Lang. Eng.

View full text Add to dashboard Cite

The objective, in this article, is to address the issue of the comparability of documents, which are extracted from different sources and written in different languages. These documents are not necessarily translations of each other. This material is referred as multilingual comparable corpora. These language resources are useful for multilingual natural language processing applications, especially for low-resourced language pairs. In this paper, we collect different data in Arabic, English, and French. Two corpora are built by using available hyperlinks for Wikipedia and Euronews. Euronews is an aligned multilingual (Arabic, English, and French) corpus of 34k documents collected from Euronews website. A more challenging issue is to build comparable corpus from two different and independent media having two distinct editorial lines, such as British Broadcasting Corporation (BBC) and Al Jazeera (JSC). To build such corpus, we propose to use the Cross-Lingual Latent Semantic approach. For this purpose, documents have been harvested from BBC and JSC websites for each month of the years 2012 and 2013. The comparability is calculated for each Arabic–English couple of documents of each month. This automatic task is then validated by hand. This led to a multilingual (Arabic–English) aligned corpus of 305 pairs of documents (233k English words and 137k Arabic words). In addition, a study is presented in this paper to analyze the performance of three methods of the literature allowing to measure the comparability of documents on the multilingual reference corpora. A recall at rank 1 of 50.16 per cent is achieved with the Cross-lingual LSI approach for BBC–JSC test corpus, while the dictionary-based method reaches a recall of only 35.41 per cent.

show abstract

Section: Experimented Alignment Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data

Langlois

Saad²,

Smaïli

2018

Nat. Lang. Eng.

View full text Add to dashboard Cite

show abstract

“…In essence, it uses a bilingual dictionary for converting the word feature vectors between the languages and for estimating their overlap. The other systems are discussed in detail in the proceedings of BUCC'15 (Morin et al 2015;Zafarian et al 2015), and full evaluation results are available there as well (Sharoff, Zweigenbaum and Rapp 2015). The lina system (Morin et al 2015) is based on matching hapax legomena, i.e.…”

Section: Comparison Of Methods Used By Participating Systemsmentioning

confidence: 99%

Recent advances in machine translation using comparable corpora

Rapp¹,

Sharoff²,

Zweigenbaum³

2016

Nat. Lang. Eng.

Self Cite

View full text Add to dashboard Cite

This paper highlights some of the recent developments in the field of machine translation using comparable corpora. We start by updating previous definitions of comparable corpora and then look at bilingual versions of continuous vector space models. Recently, neural networks have been used to obtain latent context representations with only few dimensions which are often called word embeddings. These promising new techniques cannot only be applied to parallel but also to comparable corpora. Subsequent sections of the paper discuss work specifically targeting at machine translation using comparable corpora, as well as work dealing with the extraction of parallel segments from comparable corpora. Finally, we give an overview on the design and the results of a recent shared task on measuring document comparability across languages.

show abstract

“…Therefore, we endeavored to design and organize shared tasks as companions of the BUCC workshop series on Building and Using Comparable Corpora. The First BUCC Shared Task (Sharoff et al, 2015) tackled the detection of comparable documents across languages. The Second BUCC Shared Task, 1 presented here, addresses the detection of parallel sentences across languages in nonaligned, monolingual corpora.…”

Section: Introductionmentioning

confidence: 99%

Proceedings of the 10th Workshop on Building and Using Comparable Corpora

2017

View full text Add to dashboard Cite

This year the workshop included a shared task to quantitatively evaluate competing methods for extracting parallel sentences from comparable monolingual corpora, so as to give an overview on the state of the art and to identify the best performing approaches. 13 runs were submitted in time to the shared task by 4 teams, covering three of the four proposed language pairs: French-English (7 runs), GermanEnglish (3 runs), and Chinese-English (3 runs). We make the datasets are available on the workshop Web page at https://comparable.limsi.fr/bucc2017/bucc2017-task.html. AbstractDespite numerous studies devoted to mining parallel material from bilingual data, we have yet to see the resulting technologies wholeheartedly adopted by professional translators and terminologists alike. I argue that this state of affairs is mainly due to two factors: the emphasis published authors put on models (even though data is as important), and the conspicuous lack of concern for actual end-users. IntroductionParallel corpora (documents collections that are translations of one another) are the bread and butter of machine translation (MT). Solutions have been proposed for mining parallel texts found on the Web (Chen and Nie, 2000;Resnik and Smith, 2003), and for aligning sentences in parallel documents (Gale and Church, 1993), leading to socalled "bitexts". It then becomes possible to align words in parallel sentence pairs, in an unsupervised way (Brown et al., 1993).Because parallel data is relatively rare, researchers have turned to exploiting comparable corpora, e.g. news articles in different languages covering the same event. Sharoff et al. (2013) thoroughly examine this topic. It is noteworthy that researchers know quite well how to identify parallel sentences in a comparable corpus , and can then use "tried and true" procedures for extracting bilingual lexicons from such a resource (Rapp, 1995; Fung, 1995; Mikolov et al., 2013).Being able to benefit from both parallel and comparable data is quite an accomplishment from a scientific point of view, and progress is still being made on the task. In contrast, and frustratingly, the technologies that professional translators are adopting continue to rely mainly on sentencebased translation memories. I do not mean to say that other technologies are not being used. For instance, translation agencies are increasingly integrating machine translation into their workflow, but this is mostly driven by cost reduction, and not by a genuine interest in MT on the part of translators, who remain unconvinced.I submit that this limited adoption of new resources and technologies is due to the conjunction of two factors: the overall lack of concern for actual users, and the clear preference of the research community for the study of models at the cost of research on data. Of course, improvements on models have the potential to impact users. Notably, recent studies (Bentivogli et al., 2016; Isabelle et al., 2017) confirm that neural MT (Sutskever et al., 2014;Cho et al., 2014;Bahdanau et al., 201...

show abstract

BUCC Shared Task: Cross-Language Document Similarity

Cited by 12 publications

References 3 publications

Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data

Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data

Recent advances in machine translation using comparable corpora

Proceedings of the 10th Workshop on Building and Using Comparable Corpora

Contact Info

Product

Resources

About