Generalized Biwords for Bitext Compression and Translation Spotting

Sánchez-Martínez, Felipe; Carrasco, Rafael C.; Martínez‐Prieto, Miguel A.; Adiego, Joaquín

doi:10.1613/jair.3500

Cited by 10 publications

(11 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Portions of this large corpus have been used in previous compression work (Sánchez-Martínez et al, 2012). The Spanish side is in UTF-8.…”

Section: Datamentioning

confidence: 99%

“…Sánchez-Martínez et al (2012) improve the interleaving scheme and include offsets to enable decompression to reconstruct the original word order. They also compare several characterbased and word-based compression schemes for biword sequences.…”

Section: Bilingual Compression: Prior Workmentioning

confidence: 99%

“…If we are predicting the jth English word, and we know that it translates f i ("aligns to f i "), and if f i has only a handful of translations, then we may be able to specify e j with just a few bits. We may therefore suppose that a set of Viterbi word alignments may be useful for compression (Conley and Klein, 2008;Sánchez-Martínez et al, 2012). We consider unidirectional alignments that link each target position j to a single source position i (including the null word at i = 0).…”

Section: Word Alignmentmentioning

confidence: 99%

See 2 more Smart Citations

How Much Information Does a Human Translator Add to the Original?

Zoph¹,

Ghazvininejad²,

Knight³

2015

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…Portions of this large corpus have been used in previous compression work (Sánchez-Martínez et al, 2012). The Spanish side is in UTF-8.…”

Section: Datamentioning

confidence: 99%

Section: Bilingual Compression: Prior Workmentioning

confidence: 99%

Section: Word Alignmentmentioning

confidence: 99%

See 1 more Smart Citation

How Much Information Does a Human Translator Add to the Original?

Zoph¹,

Ghazvininejad²,

Knight³

2015

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…The Hutter Prize [12], a competition to compress a 100 m-word extract of English Wikipedia, was designed to futher encourage research in text compression. Bilingual and multilingual text compression is a less-studied field [1,[13][14][15][16][17][18]. These papers provide different algorithms for compressing text in multilingual format, but they do not demonstrate how humans perform on this task.…”

Section: Related Workmentioning

confidence: 99%

Humans Outperform Machines at the Bilingual Shannon Game

Ghazvininejad¹,

Knight²

2016

Entropy

View full text Add to dashboard Cite

show abstract

“…The most recent work is Sanchez-Martinez et al (2012) who propose to use "biwords" to compress parallel data sequentially. Similar as in Conley and Klein (2008), translational relations and Huffman coding are employed to take advantage of the improved entropy properties of the encoded data.…”

Section: Compression Of Parallel Corporamentioning

confidence: 99%