A Joint Model for Unsupervised Chinese Word Segmentation

Chen, Miaohong; Chang, Baobao; Pei, Wenzhe

doi:10.3115/v1/d14-1092

Cited by 17 publications

(18 citation statements)

References 13 publications

(23 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The only available resource is a very small bi-lingual lexicon with 1,000 most common Chinese words 7 and their corresponding English translations. In this setting, we use an unsupervised Chinese word segmentation approach combining a Hierarchical Dirichlet Process (HDP) model with a Bayesian HMM model (Chen et al, 2014) to segment Chinese text instead of the preprocessing steps mentioned in Section 4.1.1. According to Figure 5, our approach still performs well in the low-resource setting although its accuracy curve is lower than that in rich-resource settings, demonstrating it can work in both rich-and low-resource settings.…”

Section: Resultsmentioning

confidence: 99%

Fine-grained Coordinated Cross-lingual Text Stream Alignment for Endless Language Knowledge Acquisition

Dou

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Self Cite

View full text Add to dashboard Cite

This paper proposes to study fine-grained coordinated cross-lingual text stream alignment through a novel information network decipherment paradigm. We use Burst Information Networks as media to represent text streams and present a simple yet effective network decipherment algorithm with diverse clues to decipher the networks for accurate text stream alignment. Experiments on Chinese-English news streams show our approach not only outperforms previous approaches on bilingual lexicon extraction from coordinated text streams but also can harvest high-quality alignments from large amounts of streaming data for endless language knowledge mining, which makes it promising to be a new paradigm for automatic language knowledge acquisition.

show abstract

Section: Resultsmentioning

confidence: 99%

Fine-grained Coordinated Cross-lingual Text Stream Alignment for Endless Language Knowledge Acquisition

Dou

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Self Cite

View full text Add to dashboard Cite

show abstract

“…We evaluate our models on SIGHAN 2005 bakeoff (Emerson, 2005) datasets and replace all the punctuation marks with punc , English characters with eng and Arabic numbers with num (Chen et al, 2014;Wang et al, 2011;Mochihashi et al, 2009;Magistry and Sagot, 2012) for all text and only consider segment the text between punctuations. Following Chen et al (2014) , we use both training data and test data for training and only test data are used for evaluation. In order to make a fair comparison with the previous works, we do not consider using other larger raw corpus.…”

Section: Experimental Settings and Detailmentioning

confidence: 99%

Unsupervised Neural Word Segmentation for Chinese via Segmental Language Modeling

Sun

Deng

2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Previous traditional approaches to unsupervised Chinese word segmentation (CWS) can be roughly classified into discriminative and generative models. The former uses the carefully designed goodness measures for candidate segmentation, while the latter focuses on finding the optimal segmentation of the highest generative probability. However, while there exists a trivial way to extend the discriminative models into neural version by using neural language models, those of generative ones are non-trivial. In this paper, we propose the segmental language models (SLMs) for CWS. Our approach explicitly focuses on the segmental nature of Chinese, as well as preserves several properties of language models. In SLMs, a context encoder encodes the previous context and a segment decoder generates each segment incrementally. As far as we know, we are the first to propose a neural model for unsupervised CWS and achieve competitive performance to the state-of-theart statistical models on four different datasets from SIGHAN 2005 bakeoff.

show abstract

“…Kyoto BCCWJ MSR CITYU BEST Precision (All) 99.9 99.9 99.6 99.9 99.0 Precision ( Table 4: Accuracies of unsupervised word segmentation. BE is a Branching Entropy method of Zhikov et al (2010), and HMM 2 is a product of word and character HMMs of Chen et al (2014). * is the accuracy decoded with L = 3: it becomes 81.7 with L = 4 as MSR and PKU.…”

Section: Datasetmentioning

confidence: 99%

“…This means that we want the most "natural" segmentation w that have a high probability in a language model p(w|s). Lately, Chen et al (2014) proposed an intermediate model between heuristic and statistical models as a product of character and word HMMs. However, these two models do not have information shared between the models, which is not the case with generative models.…”

Section: Introductionmentioning

confidence: 99%

Inducing Word and Part-of-Speech with Pitman-Yor Hidden Semi-Markov Models

Uchiumi¹,

Tsukahara²,

Mochihashi³

2015

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Confere

View full text Add to dashboard Cite

We propose a nonparametric Bayesian model for joint unsupervised word segmentation and part-of-speech tagging from raw strings. Extending a previous model for word segmentation, our model is called a Pitman-Yor Hidden Semi-Markov Model (PYHSMM) and considered as a method to build a class n-gram language model directly from strings, while integrating character and word level information. Experimental results on standard datasets on Japanese, Chinese and Thai revealed it outperforms previous results to yield the state-of-the-art accuracies. This model will also serve to analyze a structure of a language whose words are not identified a priori.

show abstract

A Joint Model for Unsupervised Chinese Word Segmentation

Cited by 17 publications

References 13 publications

Fine-grained Coordinated Cross-lingual Text Stream Alignment for Endless Language Knowledge Acquisition

Fine-grained Coordinated Cross-lingual Text Stream Alignment for Endless Language Knowledge Acquisition

Unsupervised Neural Word Segmentation for Chinese via Segmental Language Modeling

Inducing Word and Part-of-Speech with Pitman-Yor Hidden Semi-Markov Models

Contact Info

Product

Resources

About