Evaluation of Alignment Methods for HTML Parallel Text

Sánchez-Villamil, Enrique; Santos-Antón, Susana; Ortiz-Rojas, Sergio; Forcada, Mikel L.

doi:10.1007/11816508_29

Cited by 5 publications

(7 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In terms of alignment quality, there is a complete study (Sanchez-Villamil et al, 2006) with results about this issue. The metrics used to evaluate Bitextor have been precision and recall.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor

Esplà-Gomis¹,

Forcada²

2010

The Prague Bulletin of Mathematical Linguistics

View full text Add to dashboard Cite

Nowadays, many websites in the Internet are multilingual and may be considered sources of parallel corpora. In this paper we will describe the free/open-source tool Bitextor, created to harvest aligned bitexts from these multilingual websites, which may be used to train corpusbased machine translation systems. This tool uses the work developed in previous approaches with modifications and improvements in order to obtain a tool as adaptable as possible to make it easier to process any kind of websites and work with any pairs of languages. Content-based and URL-based heuristics and algorithms applied to identify and align the parallel web pages in a website will be described and, finally, some results will be presented to show the functionality of the application and set the future work lines for this project.

show abstract

Section: Resultsmentioning

confidence: 99%

“…To assist in this task, another free/open-source application has been used: the TagAligner tool (Sanchez-Villamil et al, 2006), which both uses the tag structure in XML files and the length of the sentences in a pair of documents to align them (Brown et al, 1991;Gale and Church, 1994).…”

Section: Introductionmentioning

confidence: 99%

Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor

Esplà-Gomis¹,

Forcada²

2010

The Prague Bulletin of Mathematical Linguistics

View full text Add to dashboard Cite

show abstract

“…The WWW can be regarded as a huge corpus containing millions of texts of variable quality (Kilgarriff and Grefenstette, 2003) and, thus, a collection of bitexts (a bitext is composed of versions in two different languages of a given text) can be built by finding pairs of documents in the web which are mutual translations. In order to identify automatically pairs of Uniform Resource Locators (URL) whose contents are bitext candidates, some approaches (Resnik and Smith, 2003) take into account the similitude of the URLs, the textual content of the pages and, to some extent, the structure of the text provided by the HTML tags (Sánchez-Villamil et al, 2006). Here, we will explore a complementary approach: after a collection of texts in the source language is compiled, documents which are a possible translation of the source texts are sought for.…”

Section: Introductionmentioning

confidence: 99%

Document Translation Retrieval Based on Statistical Machine Translation Techniques

Sánchez-Martínez

Carrasco

2011

Applied Artificial Intelligence

View full text Add to dashboard Cite

We compare different strategies to apply statistical machine translation techniques in order to retrieve documents which are a plausible translation of a given source document. Finding the translated version of a document is a relevant task, for example, when building a corpus of parallel texts that can help to create and to evaluate new machine translation systems.In contrast to the traditional settings in cross-language information retrieval tasks, in this case both the source and the target text are long and, thus, the procedure used to select what words or phrases will be included in the query has a key effect on the retrieval performance. In the statistical approach explored here, both the probability of the translation and the relevance of the terms are taken into account in order to build an effective query.

show abstract

“…The relatively recent exploration of the web as a bilingual or multi-lingual corpus was made possible by the rapid growth in the number of web pages, and the availability of vast quantities of web-based translation texts involving many language pairs. Till now, the focus of most of the investigations in this field has been on the discovery and pairing of bilingual sites, domains, HTML documents and pages, although new research is emerging in processing and preparing HTML pages for the actual extraction of translation pairs when bilingual web pages are downloaded (Sanchez-Villamil et al, 2006). At the same time, extracting translations, whether from unannotated data resources or from meta-information-rich content, inevitably involves methods of aligning bilingual texts.…”

Section: Introductionmentioning

confidence: 99%

Automatic extraction of translations from web-based bilingual materials

Zhu

Inkpen

Asudeh

2007

Machine Translation

View full text Add to dashboard Cite

Abstract. This paper describes the framework of the StatCan Daily Translation Extraction System (SDTES), a computer system that maps and compares webbased translation texts of Statistics Canada (StatCan) news releases in the StatCan publication The Daily. The goal is to extract translations for translation memory systems, for translation terminology building, for cross-language information retrieval and for corpus-based machine translation systems. Three years of officially published statistical news release texts at www.statcan.ca were collected to compose the StatCan Daily data bank. The English and French texts in this collection were roughly aligned using the Gale-Church statistical algorithm. After this, boundary markers of text segments and paragraphs were adjusted and the Gale-Church algorithm was run a second time for a more fine-grained text segment alignment. To detect misaligned areas of texts and to prevent mis-matched translation pairs from being selected, key textual and structural properties of the mapped texts were automatically identified and used as anchoring features for comparison and misalignment detection. Results show that SDTES is very efficient in extracting translations from Daily texts, and very accurate in identifying mismatched translations. With parameters tuned, the text-mapping part can be used to align officially published bilingual government web-site materials; and the text-comparing component can be applied in pre-publication translation quality control and in evaluating the results of statistical machine translation systems.

show abstract

Evaluation of Alignment Methods for HTML Parallel Text

Cited by 5 publications

References 6 publications

Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor

Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor

Document Translation Retrieval Based on Statistical Machine Translation Techniques

Automatic extraction of translations from web-based bilingual materials

Contact Info

Product

Resources

About