Fast and Accurate Neural Word Segmentation for Chinese

Cai, Deng; Zhao, Hai; Zhang, Zhisong; Yuan, Xin; Wu, Yongjian; Huang, Feiyue

doi:10.18653/v1/p17-2096

Cited by 94 publications

(93 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For example, Liu et al (2016) runs bi-directional LSTM over characters of the word candidate and then concatenate bi-directional LSTM outputs at both end points. Cai et al (2017) adopts a gating mechanism to control relative importance of each character in the word candidate. Besides modeling word representation directly, sequential labeling is another popular approach.…”

Section: Modelmentioning

confidence: 99%

“…Neural networks have become ubiquitous in natural language processing. For the word segmentation task, there has been a growing body of work exploring novel neural network architectures for learning useful representation and thus better segmentation prediction (Pei et al, 2014;Ma and Hinrichs, 2015;Zhang et al, 2016a;Liu et al, 2016;Cai et al, 2017;Wang and Xu, 2017).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

State-of-the-art Chinese Word Segmentation with Bi-LSTMs

Ma¹,

Ganchev²,

Weiss³

2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

A wide variety of neural-network architectures have been proposed for the task of Chinese word segmentation. Surprisingly, we find that a bidirectional LSTM model, when combined with standard deep learning techniques and best practices, can achieve better accuracy on many of the popular datasets as compared to models based on more complex neuralnetwork architectures. Furthermore, our error analysis shows that out-of-vocabulary words remain challenging for neural-network models, and many of the remaining errors are unlikely to be fixed through architecture changes. Instead, more effort should be made on exploring resources for further improvement.

show abstract

Section: Modelmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

State-of-the-art Chinese Word Segmentation with Bi-LSTMs

Ma¹,

Ganchev²,

Weiss³

2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…5 CMRC-2017 Leaderboard: http://www.hfl-tek.com/cmrc2017/ leaderboard/. 6 The word vocabulary sizes of SNLI and CMRC-2017 are 30k and 90k respectively. Table VI, which shows our Word + BPE-FRQ significantly outperforms the CAS Reader in all types of testing, with improvements of 7.0% on PD and 8.8% on CFT test sets, respectively.…”

Section: B Reading Comprehensionmentioning

confidence: 99%

Effective Subword Segmentation for Text Comprehension

Zhang

Zhao

Ling

et al. 2019

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Representation learning is the foundation of machine reading comprehension and inference. In state-of-theart models, character-level representations have been broadly adopted to alleviate the problem of effectively representing rare or complex words. However, character itself is not a natural minimal linguistic unit for representation or word embedding composing due to ignoring the linguistic coherence of consecutive characters inside word. This paper presents a general subwordaugmented embedding framework for learning and composing computationally-derived subword-level representations. We survey a series of unsupervised segmentation methods for subword acquisition and different subword-augmented strategies for text understanding, showing that subword-augmented embedding significantly improves our baselines in various types of text understanding tasks on both English and Chinese benchmarks.

show abstract

“…To alleviate the noise issue introduced by the extra part in the source side, inspired by the work of (Dhingra et al, 2016;Pang et al, 2016;Zhang et al, 2018c,a,b;Cai et al, 2017b), our model adopts a gated-attention (GA) mechanism that performs multiple hops over the pinyin with the extended context as shown in Figure 1(d).…”

Section: Modelmentioning

confidence: 99%

Chinese Pinyin Aided IME, Input What You Have Not Keystroked Yet

Huang¹,

Zhao²

2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Self Cite

View full text Add to dashboard Cite

Chinese pinyin input method engine (IME) converts pinyin into character so that Chinese characters can be conveniently inputted into computer through common keyboard. IMEs work relying on its core component, pinyinto-character conversion (P2C). Usually Chinese IMEs simply predict a list of character sequences for user choice only according to user pinyin input at each turn. However, Chinese inputting is a multi-turn online procedure, which can be supposed to be exploited for further user experience promoting. This paper thus for the first time introduces a sequenceto-sequence model with gated-attention mechanism for the core task in IMEs. The proposed neural P2C model is learned by encoding previous input utterance as extra context to enable our IME capable of predicting character sequence with incomplete pinyin input. Our model is evaluated in different benchmark datasets showing great user experience improvement compared to traditional models, which demonstrates the first engineering practice of building Chinese aided IME.

show abstract

Fast and Accurate Neural Word Segmentation for Chinese

Cited by 94 publications

References 27 publications

State-of-the-art Chinese Word Segmentation with Bi-LSTMs

State-of-the-art Chinese Word Segmentation with Bi-LSTMs

Effective Subword Segmentation for Text Comprehension

Chinese Pinyin Aided IME, Input What You Have Not Keystroked Yet

Contact Info

Product

Resources

About