Proceedings of the Eighth Workshop on Building and Using Comparable Corpora 2015
DOI: 10.18653/v1/w15-3411
|View full text |Cite
|
Sign up to set email alerts
|

BUCC Shared Task: Cross-Language Document Similarity

Abstract: We summarise the organisation and results of the first shared task aimed at detecting the most similar texts in a large multilingual collection. The dataset of the shared was based on Wikipedia dumps with interlanguage links with further filtering to ensure comparability of the paired articles. The eleven system runs we received have been evaluated using the TREC evaluation metrics.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
15
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 12 publications
(15 citation statements)
references
References 3 publications
(4 reference statements)
0
15
0
Order By: Relevance
“…In this section, we present some alignment methods of the state-of-the-art, and we experiment them in Section 6. We present the hapax-based method and a dictionary-based (DB) method that have been used in BUCC 2015 shared task campaign (Sharoff et al, 2015). This campaign asked the competitors to retrieve comparable documents from a Wikipedia corpus.…”
Section: Experimented Alignment Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…In this section, we present some alignment methods of the state-of-the-art, and we experiment them in Section 6. We present the hapax-based method and a dictionary-based (DB) method that have been used in BUCC 2015 shared task campaign (Sharoff et al, 2015). This campaign asked the competitors to retrieve comparable documents from a Wikipedia corpus.…”
Section: Experimented Alignment Methodsmentioning
confidence: 99%
“…There are two kinds of methods are tested in this paper to reach the objective of aligning documents; the first kind is based on lexical information and the second relies on latent information. We compare these methods on three different multilingual corpora; one is provided by the evaluation campaign: BUCC 2015 (Sharoff, Zweigenbaum, and Rapp, 2015). This corpus is extracted from Wikipedia, the experiments consist in aligning French and English documents.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In essence, it uses a bilingual dictionary for converting the word feature vectors between the languages and for estimating their overlap. The other systems are discussed in detail in the proceedings of BUCC'15 (Morin et al 2015;Zafarian et al 2015), and full evaluation results are available there as well (Sharoff, Zweigenbaum and Rapp 2015). The lina system (Morin et al 2015) is based on matching hapax legomena, i.e.…”
Section: Comparison Of Methods Used By Participating Systemsmentioning
confidence: 99%
“…Therefore, we endeavored to design and organize shared tasks as companions of the BUCC workshop series on Building and Using Comparable Corpora. The First BUCC Shared Task (Sharoff et al, 2015) tackled the detection of comparable documents across languages. The Second BUCC Shared Task, 1 presented here, addresses the detection of parallel sentences across languages in nonaligned, monolingual corpora.…”
Section: Introductionmentioning
confidence: 99%