Contextual dependencies in unsupervised word segmentation

Goldwater, Sharon; Griffiths, Thomas L.; Johnson, Mark

doi:10.3115/1220175.1220260

Cited by 127 publications

(134 citation statements)

References 14 publications

Supporting

Mentioning

131

Contrasting

Unclassified

Order By: Relevance

“…We evaluate three models of this type: local minima in transitional probability (TP); minima in TP with smoothed counts; local minima in pointwise mutual information. We then evaluate three other models which focused on finding a lexicon to fit the input corpus: a clustering model by Swingley (2005) which also uses pointwise mutual information; PARSER (Perruchet & Vinter, 1998), a memory-decay model of segmentation; and a Bayesian model in the style of Brent (1999) by Goldwater, Griffiths, and Johnson (2006).…”

Section: Introductionmentioning

confidence: 99%

Modeling human performance in statistical word segmentation

et al. 2010

Self Cite

View full text Add to dashboard Cite

What mechanisms support the ability of human infants, adults, and other primates to identify words from fluent speech using distributional regularities? In order to better characterize this ability, we collected data from adults in an artificial language segmentation task similar to in which the length of sentences was systematically varied between groups of participants. We then compared the fit of a variety of computational modelsincluding simple statistical models of transitional probability and mutual information, a clustering model based on mutual information by Swingley (2005), PARSER (Perruchet & Vintner, 1998), and a Bayesian model. We found that while all models were able to successfully complete the task, fit to the human data varied considerably, with the Bayesian model achieving the highest correlation with our results.

show abstract

Section: Introductionmentioning

confidence: 99%

Modeling human performance in statistical word segmentation

et al. 2010

Self Cite

View full text Add to dashboard Cite

show abstract

“…In (Goldwater et al, 2006) they report issues with mixing in the sampler that were overcome using annealing. In (Mochihashi et al, 2009) this issue was overcome by using a blocked sampler together with a dynamic programming approach.…”

Section: Bayesian Inferencementioning

confidence: 99%

Integrating Dictionaries into an Unsupervised Model for Myanmar Word Segmentation

Thu

Finch

Sumita

et al. 2014

Proceedings of the Fifth Workshop on South and Southeast Asian Natural Language Processing

View full text Add to dashboard Cite

This paper addresses the problem of word segmentation for low resource languages, with the main focus being on Myanmar language. In our proposed method, we focus on exploiting limited amounts of dictionary resource, in an attempt to improve the segmentation quality of an unsupervised word segmenter. Three models are proposed. In the first, a set of dictionaries (separate dictionaries for different classes of words) are directly introduced into the generative model. In the second, a language model was built from the dictionaries, and the n-gram model was inserted into the generative model. This model was expected to model words that did not occur in the training data. The third model was a combination of the previous two models. We evaluated our approach on a corpus of manually annotated data. Our results show that the proposed methods are able to improve over a fully unsupervised baseline system. The best of our systems improved the F-score from 0.48 to 0.66. In addition to segmenting the data, one proposed method is also able to partially label the segmented corpus with POS tags. We found that these labels were approximately 66% accurate.

show abstract

“…In [19] they report issues with mixing in the sampler that were overcome using annealing. In [18] this issue was overcome by using a blocked sampler together with a dynamic programming approach.…”

Section: Gibbs Samplingmentioning

confidence: 99%

“…The Dirichlet process model we use in our approach is a simple model that resembles the cache models used in language modeling [19]. Intuitively, the model has two basic components: a model for generating an outcome that has already been generated at least once before, and a second model that assigns a probability to an outcome that has not yet been produced.…”

Section: Unigram Dirichlet Process Modelmentioning

confidence: 99%

A Bayesian Model of Transliteration and Its Human Evaluation When Integrated into a Machine Translation System

Finch¹,

Yasuda²,

Okuma³

et al. 2011

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYThe contribution of this paper is two-fold. Firstly, we conduct a large-scale real-world evaluation of the effectiveness of integrating an automatic transliteration system with a machine translation system. A human evaluation is usually preferable to an automatic evaluation, and in the case of this evaluation especially so, since the common machine translation evaluation methods are affected by the length of the translations they are evaluating, often being biassed towards translations in terms of their length rather than the information they convey. We evaluate our transliteration system on data collected in field experiments conducted all over Japan. Our results conclusively show that using a transliteration system can improve machine translation quality when translating unknown words. Our second contribution is to propose a novel Bayesian model for unsupervised bilingual character sequence segmentation of corpora for transliteration. The system is based on a Dirichlet process model trained using Bayesian inference through blocked Gibbs sampling implemented using an efficient forward filtering/backward sampling dynamic programming algorithm. The Bayesian approach is able to overcome the overfitting problem inherent in maximum likelihood training. We demonstrate the effectiveness of our Bayesian segmentation by using it to build a translation model for a phrase-based statistical machine translation (SMT) system trained to perform transliteration by monotonic transduction from character sequence to character sequence. The Bayesian segmentation was used to construct a phrase-table and we compared the quality of this phrase-table to one generated in the usual manner by the state-of-the-art GIZA++ word alignment process used in combination with phrase extraction heuristics from the MOSES statistical machine translation system, by using both to perform transliteration generation within an identical framework. In our experiments on English-Japanese data from the NEWS2010 transliteration generation shared task, we used our technique to bilingually co-segment the training corpus. We then derived a phrase-table from the segmentation from the sample at the final iteration of the training procedure, and the resulting phrase-table was used to directly substitute for the phrase-table extracted by using GIZA++/MOSES. The phrase-table resulting from our Bayesian segmentation model was approximately 30% smaller than that produced by the SMT system's training procedure, and gave an increase in transliteration quality measured in terms of both word accuracy and Fscore.

show abstract

Contextual dependencies in unsupervised word segmentation

Cited by 127 publications

References 14 publications

Modeling human performance in statistical word segmentation

Modeling human performance in statistical word segmentation

Integrating Dictionaries into an Unsupervised Model for Myanmar Word Segmentation

A Bayesian Model of Transliteration and Its Human Evaluation When Integrated into a Machine Translation System

Contact Info

Product

Resources

About