2012
DOI: 10.1613/jair.3500
|View full text |Cite
|
Sign up to set email alerts
|

Generalized Biwords for Bitext Compression and Translation Spotting

Abstract: Large bilingual parallel texts (also known as bitexts) are usually stored in a compressed form, and previous work has shown that they can be more efficiently compressed if the fact that the two texts are mutual translations is exploited. For example, a bitext can be seen as a sequence of biwords ---pairs of parallel words with a high probability of co-occurrence--- that can be used as an intermediate representation in the compression process. However, the simple biword approach described in the literature can … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2012
2012
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 10 publications
(11 citation statements)
references
References 52 publications
0
11
0
Order By: Relevance
“…Portions of this large corpus have been used in previous compression work (Sánchez-Martínez et al, 2012). The Spanish side is in UTF-8.…”
Section: Datamentioning
confidence: 99%
See 2 more Smart Citations
“…Portions of this large corpus have been used in previous compression work (Sánchez-Martínez et al, 2012). The Spanish side is in UTF-8.…”
Section: Datamentioning
confidence: 99%
“…Sánchez-Martínez et al (2012) improve the interleaving scheme and include offsets to enable decompression to reconstruct the original word order. They also compare several characterbased and word-based compression schemes for biword sequences.…”
Section: Bilingual Compression: Prior Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The Hutter Prize [12], a competition to compress a 100 m-word extract of English Wikipedia, was designed to futher encourage research in text compression. Bilingual and multilingual text compression is a less-studied field [1,[13][14][15][16][17][18]. These papers provide different algorithms for compressing text in multilingual format, but they do not demonstrate how humans perform on this task.…”
Section: Related Workmentioning
confidence: 99%
“…The most recent work is Sanchez-Martinez et al (2012) who propose to use "biwords" to compress parallel data sequentially. Similar as in Conley and Klein (2008), translational relations and Huffman coding are employed to take advantage of the improved entropy properties of the encoded data.…”
Section: Compression Of Parallel Corporamentioning
confidence: 99%