A Transition-based Model for Joint Segmentation, POS-tagging and Normalization

Qian, Tao; Zhang, Yue; Zhang, Meishan; Ren, Yafeng; Ji, Donghong

doi:10.18653/v1/d15-1211

Cited by 25 publications

(12 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Zhang, Chen and Huang 2014 use a graph-based approach for Chinese social media text normalization. Qian et al 2015 use a transition-based model for joint segmentation, POS-tagging and normalization for the Chinese language. Duran et al 2015 propose a lexicon-based tool for user-generated content (UGC) normalization in Brazilian Portuguese.…”

Section: Related Workmentioning

confidence: 99%

Social media text normalization for Turkish

Eryiğit

Torunoğlu-Selamet

2017

Nat. Lang. Eng.

View full text Add to dashboard Cite

Text normalization is an indispensable stage in processing noncanonical language from natural sources, such as speech, social media or short text messages. Research in this field is very recent and mostly on English. As is known from different areas of natural language processing, morphologically rich languages (MRLs) pose many different challenges when compared to English. Turkish is a strong representative of MRLs and has particular normalization problems that may not be easily solved by a single-stage pure statistical model. This article introduces the first work on the social media text normalization of an MRL and presents the first complete social media text normalization system for Turkish. The article conducts an in-depth analysis of the error types encountered in Web 2.0 Turkish texts, categorizes them into seven groups and provides solutions for each of them by dividing the candidate generation task into separate modules working in a cascaded architecture. For the first time in the literature, two manually normalized Web 2.0 datasets are introduced for Turkish normalization studies. The exact match scores of the overall system on the provided datasets are 70.40 per cent and 67.37 per cent (77.07 per cent with a case insensitive evaluation).

show abstract

Section: Related Workmentioning

confidence: 99%

Social media text normalization for Turkish

Eryiğit

Torunoğlu-Selamet

2017

Nat. Lang. Eng.

View full text Add to dashboard Cite

show abstract

“…Their rules were also implemented in a recent MA toolkit Juman++ (Tolmachev et al, 2020) For English and Chinese, various classification methods for normalization of informal words (Li and Yarowsky, 2008;Wang et al, 2013;Han and Baldwin, 2011;Jin, 2015;van der Goot, 2019) have been developed based on, for example, string, phonetic, semantic similarity, or co-occurrence frequency. Qian et al (2015) proposed a transitionbased method with append(x), separate(x), and separate_and_substitute(x,y) operations for the joint word segmentation, POS tagging, and normalization of Chinese microblog text. Dekker and van der Goot (2020) automatically generated pseudo training data from English raw tweets using noise insertion operations to achieve comparable performance without manually annotated data to an existing LN system.…”

Section: Classification Of Linguistic Phenomena In Ugtmentioning

confidence: 99%

User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization

Higashiyama

Utiyama

Watanabe

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Morphological analysis (MA) and lexical normalization (LN) are both important tasks for Japanese user-generated text (UGT). To evaluate and compare different MA/LN systems, we have constructed a publicly available Japanese UGT corpus. Our corpus comprises 929 sentences annotated with morphological and normalization information, along with category information we classified for frequent UGTspecific phenomena. Experiments on the corpus demonstrated the low performance of existing MA/LN methods for non-general words and non-standard forms, indicating that the corpus would be a challenging benchmark for further research on UGT.

show abstract

“…However, the phonetic similarity used in these systems cannot be applied to Chinese words since Pinyin has its own specific characteristics, which do not easily map to English, for determining phonetic similarity. Another main application of phonetic similarity algorithms is text normalization (Xia et al, 2006;Li et al, 2003;Han et al, 2012;Sonmez and Ozgur, 2014;Qian et al, 2015), where phonetic similarity is measured by a combination of initial and final similarities. However, the encodings used in these approaches are too coarse-grained, yielding low F1 measures.…”

Section: Related Workmentioning

confidence: 99%

Untitled

Li¹,

Danilevsky²,

Noeman³

et al. 2018

Proceedings of the 22nd Conference on Computational Natural Language Learning

View full text Add to dashboard Cite

Phonetic similarity algorithms identify words and phrases with similar pronunciation which are used in many natural language processing tasks. However, existing approaches are designed mainly for Indo-European languages and fail to capture the unique properties of Chinese pronunciation. In this paper, we propose a high dimensional encoded phonetic similarity algorithm for Chinese, DIMSIM. The encodings are learned from annotated data to separately map initial and final phonemes into n-dimensional coordinates. Pinyin phonetic similarities are then calculated by aggregating the similarities of initial, final and tone. DIMSIM demonstrates a 7.5X improvement on mean reciprocal rank over the state-of-theart phonetic similarity approaches.

show abstract

A Transition-based Model for Joint Segmentation, POS-tagging and Normalization

Cited by 25 publications

References 21 publications

Social media text normalization for Turkish

Social media text normalization for Turkish

User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization

Untitled

Contact Info

Product

Resources

About