Proceedings of the Fourth Workshop on Statistical Machine Translation - StatMT '09 2009
DOI: 10.3115/1626431.1626466
|View full text |Cite
|
Sign up to set email alerts
|

Mining a comparable text corpus for a Vietnamese - French statistical machine translation system

Abstract: International audienceThis paper presents our first attempt at constructing a Vietnamese-French statistical machine translation system. Since Vietnam-ese is an under-resourced language, we concentrate on building a large Vietnamese-French parallel corpus. A document alignment method based on publication date, special words and sentence alignment result is proposed. The paper also presents an application of the obtained parallel corpus to the construction of a Vietnamese-French statistical machine translation s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
4
0
2

Year Published

2012
2012
2020
2020

Publication Types

Select...
5
3
1

Relationship

2
7

Authors

Journals

citations
Cited by 13 publications
(6 citation statements)
references
References 13 publications
0
4
0
2
Order By: Relevance
“…Another direction, however, is to identify bitexts using only textual information, as the metadata associated with documents can often be sparse or unreliable (Uszkoreit et al, 2010). Some text-based approaches for identifying bitexts rely on methods such as n-gram scoring (Uszkoreit et al, 2010), named entity matching (Do et al, 2009), and cross-language information retrieval (Utiyama and Isahara, 2003;Munteanu and Marcu, 2005).…”
Section: Related Workmentioning
confidence: 99%
“…Another direction, however, is to identify bitexts using only textual information, as the metadata associated with documents can often be sparse or unreliable (Uszkoreit et al, 2010). Some text-based approaches for identifying bitexts rely on methods such as n-gram scoring (Uszkoreit et al, 2010), named entity matching (Do et al, 2009), and cross-language information retrieval (Utiyama and Isahara, 2003;Munteanu and Marcu, 2005).…”
Section: Related Workmentioning
confidence: 99%
“…Other approaches have identified parallel documents in unstructured web corpora by relying on metadata (Nie et al, 1999;Espla-Gomis and Forcada, 2010). Some of these methods have focused on publication date and other temporal heuristics to aid in identifying parallel documents Marcu, 2005, 2006;Udupa et al, 2009;Do et al, 2009;Abdul-Rauf and Schwenk, 2009). However, temporal features can be sparse, noisy, and unreliable.…”
Section: Related Workmentioning
confidence: 99%
“…Other approaches have identified parallel documents in unstructured web corpora by relying on metadata. Some of these methods have focused on publication date and other temporal heuristics to aid in identifying parallel documents [1,10,28,29,40]. However, temporal features are often sparse, noisy, and unreliable.…”
Section: Related Workmentioning
confidence: 99%