Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers on XX - NAACL '06 2006
DOI: 10.3115/1614049.1614098
|View full text |Cite
|
Sign up to set email alerts
|

Subword-based tagging by conditional random fields for Chinese word segmentation

Abstract: We proposed two approaches to improve Chinese word segmentation: a subword-based tagging and a confidence measure approach. We found the former achieved better performance than the existing character-based tagging, and the latter improved segmentation further by combining the former with a dictionary-based segmentation. In addition, the latter can be used to balance out-of-vocabulary rates and in-vocabulary rates. By these techniques we achieved higher F-scores in CITYU, PKU and MSR corpora than the best resul… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
25
0
1

Year Published

2008
2008
2018
2018

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 33 publications
(26 citation statements)
references
References 2 publications
0
25
0
1
Order By: Relevance
“…5 https://code.google.com/p/word2vec/ Model PKU MSRA Best05 95.0 96.0 Best05 (Tseng et al, 2005) 95.0 96.4 (Zhang et al, 2006) 95.1 97.1 (Zhang and Clark, 2007) 94.5 97.2 95.2 97.3 (Sun et al, 2012) 95.4 97.4 (Zhang et al, 2013) 96 A very common feature in Chinese word segmentation is the character bigram feature. Formally, at the i-th character of a sentence c [1:n] , the bigram features are c k c k+1 (i − 3 < k < i + 2).…”
Section: Minimal Feature Engineeringmentioning
confidence: 99%
“…5 https://code.google.com/p/word2vec/ Model PKU MSRA Best05 95.0 96.0 Best05 (Tseng et al, 2005) 95.0 96.4 (Zhang et al, 2006) 95.1 97.1 (Zhang and Clark, 2007) 94.5 97.2 95.2 97.3 (Sun et al, 2012) 95.4 97.4 (Zhang et al, 2013) 96 A very common feature in Chinese word segmentation is the character bigram feature. Formally, at the i-th character of a sentence c [1:n] , the bigram features are c k c k+1 (i − 3 < k < i + 2).…”
Section: Minimal Feature Engineeringmentioning
confidence: 99%
“…There are two primary classes of models: character-based, where the foundational units for processing are individual Chinese characters (Xue, 2003;Tseng et al, 2005;Zhang et al, 2006;Wang et al, 2010), and word-based, where the units are full words based on some dictionary or training lexicon (Andrew, 2006;Zhang and Clark, 2007). Sun (2010) details their respective theoretical strengths: character-based approaches better model the internal compositional structure of words and are therefore more effective at inducing new OOV words; word-based approaches are better at reproducing the words of the training lexicon and can capture information from significantly larger contextual spans.…”
Section: Introductionmentioning
confidence: 99%
“…Another approach has been that of Cherry and Guo (2015) and Peng and Dredze (2015), who relied on training unsupervised lexical embeddings in place of these upstream systems and achieved state-of-the-art results for English and Chinese social media, respectively. The same approach was also found helpful for NER in the news domain (Collobert and Weston, 2008;Passos et al, 2014) In Asian languages like Chinese, Japanese and Korean, word segmentation is a critical first step for many tasks (Gao et al, 2005;Zhang et al, 2006;Mao et al, 2008). Peng and Dredze (2015) showed the value of word segmentation to Chinese NER in social media by using character positional embeddings, which encoded word segmentation information.…”
Section: Introductionmentioning
confidence: 99%