Integrating unsupervised and supervised word segmentation: The role of goodness measures

Zhang, Hai; Kit, Chunyu

doi:10.1016/j.ins.2010.09.008

Cited by 50 publications

(27 citation statements)

References 28 publications

(56 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, we used the sentences from the NTCIR-8 JE test set as the development set for JE task. The word segmentation was done by BaseSeg (Zhao et al, 2006;Zhao and Kit, 2008;Zhao and Kit, 2011;Zhao et al, 2013) for Chinese and Mecab 2 for Japanese.…”

Section: Methodsmentioning

confidence: 99%

Learning Word Reorderings for Hierarchical Phrase-based Statistical Machine Translation

zhang

Utiyama

Sumita

et al. 2015

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Confere

Self Cite

View full text Add to dashboard Cite

Statistical models for reordering source words have been used to enhance the hierarchical phrase-based statistical machine translation system. Existing word reordering models learn the reordering for any two source words in a sentence or only for two continuous words. This paper proposes a series of separate sub-models to learn reorderings for word pairs with different distances. Our experiments demonstrate that reordering sub-models for word pairs with distance less than a specific threshold are useful to improve translation quality. Compared with previous work, our method may more effectively and efficiently exploit helpful word reordering information.

show abstract

Section: Methodsmentioning

confidence: 99%

Learning Word Reorderings for Hierarchical Phrase-based Statistical Machine Translation

zhang

Utiyama

Sumita

et al. 2015

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Confere

Self Cite

View full text Add to dashboard Cite

show abstract

“…Therefore, we used the sentences from the NTCIR-8 JE test set as the development set. Word segmentation was done by BaseSeg (Zhao et al, 2006;Zhao and Kit, 2008;Zhao and Kit, 2011; for Chinese and Mecab 2 for Japanese. To learn the classifiers for each translation task, the training set and development set were put together to obtain symmetric word alignment using GIZA++ (Och and Ney, 2003) and the growdiag-final-and heuristic (Koehn et al, 2003).…”

Section: Methodsmentioning

confidence: 99%

Learning Hierarchical Translation Spans

zhang¹,

Utiyama²,

Sumita³

et al. 2014

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

View full text Add to dashboard Cite

We propose a simple and effective approach to learn translation spans for the hierarchical phrase-based translation model. Our model evaluates if a source span should be covered by translation rules during decoding, which is integrated into the translation system as soft constraints.Compared to syntactic constraints, our model is directly acquired from an aligned parallel corpus and does not require parsers. Rich source side contextual features and advanced machine learning methods were utilized for this learning task. The proposed approach was evaluated on NTCIR-9 Chinese-English and Japanese-English translation tasks and showed significant improvement over the baseline system.

show abstract

“…Two key techniques, word segmentation (Zhao et al, 2006a;Zhao and Kit, 2008b;Zhao et al, 2006b;Zhao and Kit, 2008a;Zhao and Kit, 2007;Zhao and Kit, 2011;Zhao et al, 2010) and language model (LM), are also popularly used for C-SC. Most of those approaches can fall into four categories.…”

Section: Related Workmentioning

confidence: 99%

An Improved Graph Model for Chinese Spell Checking

Xin¹,

Zhao

Wang

et al. 2014

Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing

Self Cite

View full text Add to dashboard Cite

In this paper, we propose an improved graph model for Chinese spell checking. The model is based on a graph model for generic errors and two independentlytrained models for specific errors. First, a graph model represents a Chinese sentence and a modified single source shortest path algorithm is performed on the graph to detect and correct generic spelling errors.Then, we utilize conditional random fields to solve two specific kinds of common errors: the confusion of "在" (at) (pinyin is 'zai' in Chinese), "再" (again, more, then) (pinyin: zai) and "的" (of) (pinyin: de), "地" (-ly, adverb-forming particle) (pinyin: de), "得" (so that, have to) (pinyin: de). Finally, a rule based system is exploited to solve the pronoun usage confusions: "她" (she) (pinyin: ta), "他" (he) (pinyin: ta) and some others fixed collocation errors. The proposed model is evaluated on the standard data set released by the SIGHAN Bake-off 2014 shared task, and gives competitive result. * This work was partially supported by the National Natural Science Foundation of China (No. 60903119, No. 61170114, and No. 61272248) (CSC fund 201304490199 and 201304490171), and the art and science interdiscipline funds of Shanghai Jiao Tong University (A study on mobilization mechanism and alerting threshold setting for online community, and media image and psychology evaluation: a computational intelligence approach).

show abstract

Integrating unsupervised and supervised word segmentation: The role of goodness measures

Cited by 50 publications

References 28 publications

Learning Word Reorderings for Hierarchical Phrase-based Statistical Machine Translation

Learning Word Reorderings for Hierarchical Phrase-based Statistical Machine Translation

Learning Hierarchical Translation Spans

An Improved Graph Model for Chinese Spell Checking

Contact Info

Product

Resources

About