Inducing Word and Part-of-Speech with Pitman-Yor Hidden Semi-Markov Models

Uchiumi, Kei; Tsukahara, Hiroshi; Mochihashi, Daichi

doi:10.3115/v1/p15-1171

Cited by 25 publications

(21 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The natural language processing tasks include (1) Part-of-Speech Tagging (POS-Tag): Part-of-Speech (POS) tagging is an important and highly competitive task in natural language processing. We use the standard benchmark dataset in prior work [5,40], which is derived from raw features in total. The evaluation metric is balanced F-score.…”

Section: Tasksmentioning

confidence: 99%

Towards easier and faster sequence labeling for natural language processing: A search-based probabilistic online learning framework (SAPO)

Sun

Zhang

et al. 2019

Information Sciences

View full text Add to dashboard Cite

There are two major approaches for sequence labeling. One is the probabilistic gradient-based methods such as conditional random fields (CRF) and neural networks (e.g., RNN), which have high accuracy but drawbacks: slow training, and no support of search-based optimization (which is important in many cases).The other is the search-based learning methods such as structured perceptron and margin infused relaxed algorithm (MIRA), which have fast training but also drawbacks: low accuracy, no probabilistic information, and non-convergence in real-world tasks. We propose a novel and "easy" solution, a search-based probabilistic online learning method, to address most of those issues. The method is "easy", because the optimization algorithm at the training stage is as simple as the decoding algorithm at the test stage. This method searches the output candidates, derives probabilities, and conducts efficient online learning. We show that this method, which is easy to implement, can support search-based optimization and obtain top accuracy with fast training and theoretical guarantee of convergence. Experiments on well-known tasks show that our method has better accuracy than CRF and BiLSTM 1 .

show abstract

Section: Tasksmentioning

confidence: 99%

Towards easier and faster sequence labeling for natural language processing: A search-based probabilistic online learning framework (SAPO)

Sun

Zhang

et al. 2019

Information Sciences

View full text Add to dashboard Cite

show abstract

“…The model proposed in this paper has a close connection to unsupervised word segmentation and part-of-speech (POS) induction [6]. A key difference is that, while they use characters as the unit for the input sequence, we utilize word sequences.…”

Section: Unsupervised Word Segmentation and Part-of-speech Inductionmentioning

confidence: 99%

“…Uchiumi et al [6] can be seen as an extension to Mochihashi et al [35], who focused on unsupervised word segmentation. They proposed a nonparametric Bayesian n-gram language model based on Pitman-Yor processes.…”

Section: Unsupervised Word Segmentation and Part-of-speech Inductionmentioning

confidence: 99%

“…Given an unsegmented corpus, the model infers word segmentation using Gibbs sampling. Uchiumi et al [6] worked on the joint task of unsupervised word segmentation and POS induction. We employ their model, PYHSMM, for our task.…”

Section: Unsupervised Word Segmentation and Part-of-speech Inductionmentioning

confidence: 99%

“…Regarding the technical aspects, the proposed tagger is a hybrid of a generative model and a discriminative model. The generative model called the Pitman-Yor hidden semi-Markov model (PYHSMM) [6] recognizes high-frequency word sequences as NE chunks and identifies their classes. The discriminative model, semi-Markov CRF (semiCRF) [7], initializes the learning process using the seed terms and generalizes to other NEs of the same classes.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Hybrid Generative/Discriminative Model for Rapid Prototyping of Domain-Specific Named Entity Recognition

Tomori¹,

Murawaki²,

Mori³

2019

EasyChair Preprints

View full text Add to dashboard Cite

We propose PYHSCRF, a novel tagger for domain-specific named entity recognition that only requires a few seed terms, in addition to unannotated corpora, and thus permits the iterative and incremental design of named entity (NE) classes for new domains. The proposed model is a hybrid of a generative model named PYHSMM and a semi-Markov CRF-based discriminative model, which play complementary roles in generalizing seed terms and in distinguishing between NE chunks and non-NE words. It also allows a smooth transition to full-scale annotation because the discriminative model makes effective use of annotated data when available. Experiments involving two languages and three domains demonstrate that the proposed method outperforms baselines.

show abstract

A Hybrid Generative/Discriminative Model for Rapid Prototyping of Domain-Specific Named Entity Recognition

Tomori

Murawaki

Mori

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Inducing Word and Part-of-Speech with Pitman-Yor Hidden Semi-Markov Models

Cited by 25 publications

References 14 publications

Towards easier and faster sequence labeling for natural language processing: A search-based probabilistic online learning framework (SAPO)

Towards easier and faster sequence labeling for natural language processing: A search-based probabilistic online learning framework (SAPO)

A Hybrid Generative/Discriminative Model for Rapid Prototyping of Domain-Specific Named Entity Recognition

A Hybrid Generative/Discriminative Model for Rapid Prototyping of Domain-Specific Named Entity Recognition

Contact Info

Product

Resources

About