2010 5th International Symposium on Telecommunications 2010
DOI: 10.1109/istel.2010.5734083
|View full text |Cite
|
Sign up to set email alerts
|

A discriminative approach to filter out noisy sentence pairs from bilingual corpora

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2012
2012
2020
2020

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(8 citation statements)
references
References 17 publications
0
8
0
Order By: Relevance
“…The main differ of the work of [11] and ours in this component is related to feature set has been employed in the classifier. The feature set proposed in our work are NullEN, NullFA, Full_NullEN, Full_NullFA, p(F|E) and p(E|F) and other features was described in [11].…”
Section: A Noisy Filtering Componentmentioning
confidence: 99%
See 1 more Smart Citation
“…The main differ of the work of [11] and ours in this component is related to feature set has been employed in the classifier. The feature set proposed in our work are NullEN, NullFA, Full_NullEN, Full_NullFA, p(F|E) and p(E|F) and other features was described in [11].…”
Section: A Noisy Filtering Componentmentioning
confidence: 99%
“…Filtering Component: The result of sentence aligning algorithm contains several erroneous sentence pairs which should be detected and eliminated from the parallel corpus. Similar to the works of [11] a MaxEnt classifier has been innovated to classify the sentence pair as correct or incorrect. The details of this component are illustrated in next section.…”
Section: Introductionmentioning
confidence: 99%
“…Their classifier is reported to have 0.88 precision and recall rates. Taghipour et al [24] also proposed a classification method for cleaning parallel data. Many features have been tested and used to build the models such as translation probabilities based on IBM translations models [3], the number of the null aligned words, length based features and features based on language model.…”
Section: Related Workmentioning
confidence: 99%
“…Building a corpus that includes vast amount of parallel sentences is one of the most time-consuming and important works for a high-performance SMT system [10]. Training the translation model component in SMT requires large parallel corpora for the parameters to be estimated [24]. Therefore, higher translation accuracy can be achieved when machine translation systems are trained on increasing amounts of training data [12].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation