Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing 2015
DOI: 10.18653/v1/d15-1211
|View full text |Cite
|
Sign up to set email alerts
|

A Transition-based Model for Joint Segmentation, POS-tagging and Normalization

Abstract: We propose a transition-based model for joint word segmentation, POS tagging and text normalization. Different from previous methods, the model can be trained on standard text corpora, overcoming the lack of annotated microblog corpora. To evaluate our model, we develop an annotated corpus based on microblogs. Experimental results show that our joint model can help improve the performance of word segmentation on microblogs, giving an error reduction in segmentation accuracy of 12.02%, compared to the tradition… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
10
0

Year Published

2017
2017
2021
2021

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 25 publications
(12 citation statements)
references
References 21 publications
0
10
0
Order By: Relevance
“…Zhang, Chen and Huang 2014 use a graph-based approach for Chinese social media text normalization. Qian et al 2015 use a transition-based model for joint segmentation, POS-tagging and normalization for the Chinese language. Duran et al 2015 propose a lexicon-based tool for user-generated content (UGC) normalization in Brazilian Portuguese.…”
Section: Related Workmentioning
confidence: 99%
“…Zhang, Chen and Huang 2014 use a graph-based approach for Chinese social media text normalization. Qian et al 2015 use a transition-based model for joint segmentation, POS-tagging and normalization for the Chinese language. Duran et al 2015 propose a lexicon-based tool for user-generated content (UGC) normalization in Brazilian Portuguese.…”
Section: Related Workmentioning
confidence: 99%
“…Their rules were also implemented in a recent MA toolkit Juman++ (Tolmachev et al, 2020) For English and Chinese, various classification methods for normalization of informal words (Li and Yarowsky, 2008;Wang et al, 2013;Han and Baldwin, 2011;Jin, 2015;van der Goot, 2019) have been developed based on, for example, string, phonetic, semantic similarity, or co-occurrence frequency. Qian et al (2015) proposed a transitionbased method with append(x), separate(x), and separate_and_substitute(x,y) operations for the joint word segmentation, POS tagging, and normalization of Chinese microblog text. Dekker and van der Goot (2020) automatically generated pseudo training data from English raw tweets using noise insertion operations to achieve comparable performance without manually annotated data to an existing LN system.…”
Section: Classification Of Linguistic Phenomena In Ugtmentioning
confidence: 99%
“…However, the phonetic similarity used in these systems cannot be applied to Chinese words since Pinyin has its own specific characteristics, which do not easily map to English, for determining phonetic similarity. Another main application of phonetic similarity algorithms is text normalization (Xia et al, 2006;Li et al, 2003;Han et al, 2012;Sonmez and Ozgur, 2014;Qian et al, 2015), where phonetic similarity is measured by a combination of initial and final similarities. However, the encodings used in these approaches are too coarse-grained, yielding low F1 measures.…”
Section: Related Workmentioning
confidence: 99%