Overview of the NLPCC 2015 Shared Task: Chinese Word Segmentation and POS Tagging for Micro-blog Texts

Qiu, Xipeng; Qian, Peng; Yin, Liusong; Wu, Shiyu; Huang, Xuanjing

doi:10.1007/978-3-319-25207-0_50

Cited by 14 publications

(4 citation statements)

References 5 publications

(3 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use the NLPCC 2015 dataset 1 (Qiu et al, 2015) to evaluate our model on micro-blog texts. The NLPCC 2015 data are provided by the shared task in the 4th CCF Conference on Natural Language Processing & Chinese Computing (NLPCC 2015): Chinese Word Segmentation and POS Tagging for micro-blog Text.…”

Section: Datasetmentioning

confidence: 99%

Gated Recursive Neural Network for Chinese Word Segmentation

Chen¹,

Qiu²,

Zhu³

et al. 2015

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Confere

Self Cite

View full text Add to dashboard Cite

Recently, neural network models for natural language processing tasks have been increasingly focused on for their ability of alleviating the burden of manual feature engineering. However, the previous neural models cannot extract the complicated feature compositions as the traditional methods with discrete features. In this paper, we propose a gated recursive neural network (GRNN) for Chinese word segmentation, which contains reset and update gates to incorporate the complicated combinations of the context characters. Since GRNN is relative deep, we also use a supervised layer-wise training method to avoid the problem of gradient diffusion. Experiments on the benchmark datasets show that our model outperforms the previous neural network models as well as the state-of-the-art methods.

show abstract

Section: Datasetmentioning

confidence: 99%

Gated Recursive Neural Network for Chinese Word Segmentation

Chen¹,

Qiu²,

Zhu³

et al. 2015

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Confere

Self Cite

View full text Add to dashboard Cite

show abstract

“…It is important to study the social sentiment analysis methods for Weibo, and the Weibo text corpus is an important data set for analyzing people's views on the latest events. Unlike long, standard texts, the Weibo corpus is a relatively informal text with a preference for colloquial speech and short length [9]. Yao et al [3] applied the corpus to organize the 2nd CCF Conference on Natural Language Processing & Chinese Computing (NLP&CC 2013) Chinese Weibo sentiment analysis evaluation, which strongly promoted the research on Weibo sentiment analysis.…”

Section: Related Workmentioning

confidence: 99%

An Entropy-Based Method with a New Benchmark Dataset for Chinese Textual Affective Structure Analysis

Xiong

Fan

Batra

et al. 2023

Entropy

View full text Add to dashboard Cite

Affective understanding of language is an important research focus in artificial intelligence. The large-scale annotated datasets of Chinese textual affective structure (CTAS) are the foundation for subsequent higher-level analysis of documents. However, there are very few published datasets for CTAS. This paper introduces a new benchmark dataset for the task of CTAS to promote development in this research direction. Specifically, our benchmark is a CTAS dataset with the following advantages: (a) it is Weibo-based, which is the most popular Chinese social media platform used by the public to express their opinions; (b) it includes the most comprehensive affective structure labels at present; and (c) we propose a maximum entropy Markov model that incorporates neural network features and experimentally demonstrate that it outperforms the two baseline models.

show abstract

“…After that, the anti-word set is used to create the AP features for CRFs models by calculating the AP value of the current observed token according to Eq. (8). The AP value is also discretized for feeding the CRFs model in accordance with the following scheme.…”

Section: Ce(token)·ce(chara)mentioning

confidence: 99%

“…The training and test corpora are released by NLPCC 2015 for the shared task of microblog-oriented CWS [8], as shown in Table 2. In addition, we collect 300,000 unlabeled tweets (including 20 billion words) as the background corpus to extract features for the semi-supervised initial segmenter.…”

Section: Datasetsmentioning

confidence: 99%

Corpus Expansion for Neural CWS on Microblog-Oriented Data with <i>λ</i>-Active Learning Approach

Zhang

Huang

et al. 2018

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYMicroblog data contains rich information of real-world events with great commercial values, so microblog-oriented natural language processing (NLP) tasks have grabbed considerable attention of researchers. However, the performance of microblog-oriented Chinese Word Segmentation (CWS) based on deep neural networks (DNNs) is still not satisfying. One critical reason is that the existing microblog-oriented training corpus is inadequate to train effective weight matrices for DNNs. In this paper, we propose a novel active learning method to extend the scale of the training corpus for DNNs. However, due to a large amount of partially overlapped sentences in the microblogs, it is difficult to select samples with high annotation values from raw microblogs during the active learning procedure. To select samples with higher annotation values, parameter λ is introduced to control the number of repeatedly selected samples. Meanwhile, various strategies are adopted to measure the overall annotation values of a sample during the active learning procedure. Experiments on the benchmark datasets of NLPCC 2015 show that our λ-active learning method outperforms the baseline system and the state-of-the-art method. Besides, the results also demonstrate that the performances of the DNNs trained on the extended corpus are significantly improved. key words: Chinese word segmentation, active learning, deep neural networks, corpus expansion

show abstract

Overview of the NLPCC 2015 Shared Task: Chinese Word Segmentation and POS Tagging for Micro-blog Texts

Cited by 14 publications

References 5 publications

Gated Recursive Neural Network for Chinese Word Segmentation

Gated Recursive Neural Network for Chinese Word Segmentation

An Entropy-Based Method with a New Benchmark Dataset for Chinese Textual Affective Structure Analysis

Corpus Expansion for Neural CWS on Microblog-Oriented Data with <i>λ</i>-Active Learning Approach

Contact Info

Product

Resources

About