Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model

Morita, Hajime; Kawahara, Daisuke; Kurohashi, Sadao

doi:10.18653/v1/d15-1276

Cited by 77 publications

(50 citation statements)

References 10 publications

(12 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In contrast, we leverage both character embeddings and word embeddings for better accuracies. (Morita et al, 2015;Liu et al, 2016;Cai and Zhao, 2016), which are different from our work in the basic framework. For instance, Liu et al (2016) follow Andrew (2006) using a semi-CRF for structured inference.…”

Section: Error Analysiscontrasting

confidence: 62%

Transition-Based Neural Word Segmentation

Zhang

2016

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

109

100

View full text Add to dashboard Cite

Character-based and word-based methods are two main types of statistical models for Chinese word segmentation, the former exploiting sequence labeling models over characters and the latter typically exploiting a transition-based model, with the advantages that word-level features can be easily utilized. Neural models have been exploited for character-based Chinese word segmentation, giving high accuracies by making use of external character embeddings, yet requiring less feature engineering. In this paper, we study a neural model for word-based Chinese word segmentation, by replacing the manuallydesigned discrete features with neural features in a word-based segmentation framework. Experimental results demonstrate that word features lead to comparable performances to the best systems in the literature, and a further combination of discrete and neural features gives top accuracies.

show abstract

Section: Error Analysiscontrasting

confidence: 62%

Transition-Based Neural Word Segmentation

Zhang

2016

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

109

100

View full text Add to dashboard Cite

show abstract

“…角(corner)" (Zheng et al, 2013), which is infeasible by using sparse one-hot character features. In addition to character embeddings, distributed representations of character bigrams Pei et al, 2014) and words (Morita et al, 2015;Zhang et al, 2016b) have also been shown to improve segmentation accuracies.…”

Section: Introductionmentioning

confidence: 99%

“…With respect to non-linear modeling power, various network structures have been exploited to represent contexts for segmentation disambiguation, including multi-layer perceptrons on fivecharacter windows (Zheng et al, 2013;Pei et al, 2014;Chen et al, 2015a), as well as LSTMs on characters (Chen et al, 2015b;Xu and Sun, 2016) and words (Morita et al, 2015;Cai and Zhao, 2016;Zhang et al, 2016b). For structured learning and inference, CRF has been used for character sequence labelling models (Pei et al, 2014;Chen et al, 2015b) and structural beam search has been used for word-based segmentors (Cai and Zhao, 2016;Zhang et al, 2016b).…”

Section: Introductionmentioning

confidence: 99%

Neural Word Segmentation with Rich Pretraining

Yang¹,

Zhang²,

Dong³

2017

Proceedings of the 55th Annual Meeting of the Association For Computational Linguistics (Volume 1: Long Papers)

104

View full text Add to dashboard Cite

Neural word segmentation research has benefited from large-scale raw texts by leveraging them for pretraining character and word embeddings. On the other hand, statistical segmentation research has exploited richer sources of external information, such as punctuation, automatic segmentation and POS. We investigate the effectiveness of a range of external training sources for neural word segmentation by building a modular segmentation model, pretraining the most important submodule using rich external sources. Results show that such pretraining significantly improves the model, leading to accuracies competitive to the best methods on six benchmarks.

show abstract

“…Unfortunately, such large-scale data is not available for many lesser-studied languages, including Ainu. For Japanese and Chinese, word segmentation is sometimes modelled jointly with part-of-speech tagging, as the output of the latter task can provide useful information to the segmenter [21,[28][29][30].…”

Section: Related Workmentioning

confidence: 99%

MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

2019

View full text Add to dashboard Cite

Word segmentation is an essential task in automatic language processing for languages where there are no explicit word boundary markers, or where space-delimited orthographic words are too coarse-grained. In this paper we introduce the MiNgMatch Segmenter-a fast word segmentation algorithm, which reduces the problem of identifying word boundaries to finding the shortest sequence of lexical n-grams matching the input text. In order to validate our method in a low-resource scenario involving extremely sparse data, we tested it with a small corpus of text in the critically endangered language of the Ainu people living in northern parts of Japan. Furthermore, we performed a series of experiments comparing our algorithm with systems utilizing state-of-the-art lexical n-gram-based language modelling techniques (namely, Stupid Backoff model and a model with modified Kneser-Ney smoothing), as well as a neural model performing word segmentation as character sequence labelling. The experimental results we obtained demonstrate the high performance of our algorithm, comparable with the other best-performing models. Given its low computational cost and competitive results, we believe that the proposed approach could be extended to other languages, and possibly also to other Natural Language Processing tasks, such as speech recognition.

show abstract

Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model

Cited by 77 publications

References 10 publications

Transition-Based Neural Word Segmentation

Transition-Based Neural Word Segmentation

Neural Word Segmentation with Rich Pretraining

MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

Contact Info

Product

Resources

About