2009
DOI: 10.1007/978-3-642-03784-9_11
|View full text |Cite
|
Sign up to set email alerts
|

A Two-Level Structure for Compressing Aligned Bitexts

Abstract: Abstract.A bitext, or bilingual parallel corpus, consists of two texts, each one in a different language, that are mutual translations. Bitexts are very useful in linguistic engineering because they are used as source of knowledge for different purposes. In this paper we propose a strategy to efficiently compress and use bitexts, saving, not only space, but also processing time when exploiting them. Our strategy is based on a two-level structure for the vocabularies, and on the use of biwords, a pair of associ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
15
0

Year Published

2010
2010
2019
2019

Publication Types

Select...
3
2
1

Relationship

2
4

Authors

Journals

citations
Cited by 9 publications
(16 citation statements)
references
References 14 publications
1
15
0
Order By: Relevance
“…The Hutter Prize [12], a competition to compress a 100 m-word extract of English Wikipedia, was designed to futher encourage research in text compression. Bilingual and multilingual text compression is a less-studied field [1,[13][14][15][16][17][18]. These papers provide different algorithms for compressing text in multilingual format, but they do not demonstrate how humans perform on this task.…”
Section: Related Workmentioning
confidence: 99%
“…The Hutter Prize [12], a competition to compress a 100 m-word extract of English Wikipedia, was designed to futher encourage research in text compression. Bilingual and multilingual text compression is a less-studied field [1,[13][14][15][16][17][18]. These papers provide different algorithms for compressing text in multilingual format, but they do not demonstrate how humans perform on this task.…”
Section: Related Workmentioning
confidence: 99%
“…Another quite realistic assumption is that a source sequence already referred to once during the process will be rarely used again. The search for the next unreferenced sequence is always very local, meaning that it is performed in time O (1).…”
Section: Indices Ofmentioning
confidence: 99%
“…The idea of using alignment was raised in [6], and a detailed algorithm was presented in [7]. Different, though basically similar algorithms have recently been suggested in [13,1]. However, all these algorithms relate only to bilingual parallel texts; therefore, should three or more parallel texts be compressed, the algorithms would be applied to each sourcetarget pair of texts independently.…”
Section: Introductionmentioning
confidence: 99%
“…A biword-based scheme, called 2lcab, is proposed in [1]. 2lcab builds a two-level dictionary in which word and biword representations are stored.…”
Section: The Spanish-english Bitext (La Casa Donde Vivimos the Housementioning
confidence: 99%
“…Table 2 shows compression and decompression times, respectively; word alignment times are note taken into account. Due to lack of space we only show 2lcab and trc 1 as boosters times, however all the times obtained by using 1v and 2v as boosters are similar to those of trc 1 . Results show that the techniques we propose are very fast at compression and, mainly, at decompression.…”
Section: First-order Model On Translations Relationshipsmentioning
confidence: 99%