A discriminative approach to filter out noisy sentence pairs from bilingual corpora

Taghipour, Kaveh; Afhami, Nasim; Khadivi, Shahram; Shiry, Saeed

doi:10.1109/istel.2010.5734083

Cited by 8 publications

(8 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The main differ of the work of [11] and ours in this component is related to feature set has been employed in the classifier. The feature set proposed in our work are NullEN, NullFA, Full_NullEN, Full_NullFA, p(F|E) and p(E|F) and other features was described in [11].…”

Section: A Noisy Filtering Componentmentioning

confidence: 99%

“…Filtering Component: The result of sentence aligning algorithm contains several erroneous sentence pairs which should be detected and eliminated from the parallel corpus. Similar to the works of [11] a MaxEnt classifier has been innovated to classify the sentence pair as correct or incorrect. The details of this component are illustrated in next section.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

State-of-the-art English to Persian Statistical Machine Translation system

Mansouri

Faili

2012

The 16th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP 2012)

View full text Add to dashboard Cite

Comparison of several kinds of English-PersianStatistical Machine Translation systems is reported in this paper. A large parallel corpus containing about 6 million tokens on each side has been developed for training the proposed SMT system. In development of the parallel corpus, a noisy filtering system based on MaxEnt classifier bas been innovated to distinguish between correct and incorrect sentence pairs. By using the generated parallel corpus, a variety of SMT systems on English to Persian languages has been developed. Several variations on SMT, such as hybrid MT or statistical post editing MT has been proposed in this paper. The whole systems were tested on two different types of test set, one extracted randomly from parallel corpus and the other containing formal English sentences extracted from English learning book. The results shows hybrid system of SMT augmented by a rule based detection of English phrasal verb and Persian compound verb improves the baseline significantly. Also, state-of-the-art results on English-Persian translation are obtained by Verb-aware SMT with respect to BLEU measure.

show abstract

Section: A Noisy Filtering Componentmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

State-of-the-art English to Persian Statistical Machine Translation system

Mansouri

Faili

2012

The 16th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP 2012)

View full text Add to dashboard Cite

show abstract

“…Their classifier is reported to have 0.88 precision and recall rates. Taghipour et al [24] also proposed a classification method for cleaning parallel data. Many features have been tested and used to build the models such as translation probabilities based on IBM translations models [3], the number of the null aligned words, length based features and features based on language model.…”

Section: Related Workmentioning

confidence: 99%

“…Building a corpus that includes vast amount of parallel sentences is one of the most time-consuming and important works for a high-performance SMT system [10]. Training the translation model component in SMT requires large parallel corpora for the parameters to be estimated [24]. Therefore, higher translation accuracy can be achieved when machine translation systems are trained on increasing amounts of training data [12].…”

Section: Introductionmentioning

confidence: 99%

“…The manual compilation of a parallel corpus is too expensive, so most of available parallel corpora are generated automatically. Automatic methods for compiling parallel sentence pairs are imprecise and using such a low-quality training corpus that has many non-parallel sentence pairs would cause low quality translations [24]. The noise in an automatically generated corpus might be due to any difference between the contents of source and target documents, non-literal translations or sentence alignment mistakes.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation