Unsupervised Neural Word Segmentation for Chinese via Segmental Language Modeling

Sun, Zhiqing; Deng, Zhihong

doi:10.18653/v1/d18-1531

Cited by 18 publications

(42 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another line of techniques have been focused on models that are both strong language models and good for sequence segmentation. Many are in some way based on Connectionist Temporal Classification (Graves et al, 2006), and include Sleep-WAke Networks (Wang et al, 2017), Segmental RNNs (Kong et al, 2016), and Segmental Language Models (Sun and Deng, 2018;Kawakami et al, 2019;Wang et al, 2021;Downey et al, 2021). In this work, we conduct experiments using the Masked Segmental Language Model of Downey et al (2021), due to its good performance and scalability, the latter usually regarded as an obligatory feature of crosslingual models (Conneau et al, 2020a;Xue et al, 2021, inter alia).…”

Section: Related Workmentioning

confidence: 99%

“…The sentences originally came in a train/validation/test split, but because goldsegmented sentences are so rare, we concatenate these sets and then split them in half into final validation and test sets. MSLMs An MSLM is a variant of a Segmental Language Model (SLM) (Sun and Deng, 2018;Kawakami et al, 2019;Wang et al, 2021), which takes as input a sequence of characters x and outputs a probability distribution for a sequence of segments y such that the concatenation of the segments of y is equivalent to x: π(y) = x. An MSLM is composed of a Segmental Transformer Encoder and an LSTM-based Segment Decoder (Downey et al, 2021).…”

Section: Data and Pre-processingmentioning

confidence: 99%

“…Unsupervised sequence segmentation (at the word, morpheme, and phone level) has long been an area of interest in languages without whitespacedelimited orthography (e.g. Chinese, Uchiumi et al, 2015;Sun and Deng, 2018), morphologically complex languages without rule-based morphological anlayzers (Creutz and Lagus, 2002), and automatically phone-transcribed speech data Lane et al, 2021) respectively. It has been particularly important for lowerresource languages in which there is little or no gold-standard data on which to train supervised models.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages

Downey¹,

Drizin²,

Haroutunian³

et al. 2021

Preprint

View full text Add to dashboard Cite

We show that unsupervised sequencesegmentation performance can be transferred to extremely low-resource languages by pre-training a Masked Segmental Language Model (Downey et al., 2021) multilingually. Further, we show that this transfer can be achieved by training over a collection of low-resource languages that are typologically similar (but phylogenetically unrelated) to the target language. In our experiments, we transfer from a collection of 10 Indigenous American languages (AmericasNLP, Mager et al., 2021) to K'iche', a Mayan language. We compare our model to a monolingual baseline, and show that the multilingual pretrained approach yields much more consistent segmentation quality across target dataset sizes, including a zero-shot performance of 20.6 F1, and exceeds the monolingual performance in 9/10 experimental settings. These results have promising implications for low-resource NLP pipelines involving human-like linguistic units, such as the sparse transcription framework proposed by Bird (2020).

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Data and Pre-processingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages

Downey¹,

Drizin²,

Haroutunian³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…To solve this problem we propose a type of Segmental Language Model (Sun and Deng, 2018;Kawakami et al, 2019), based on the powerful neural Transformer architecture (Vaswani et al, 2017).…”

Section: Introductionmentioning

confidence: 99%

“…Nearperfect supervised methods have been developed for use in resource-rich languages such as Chinese, but many of the world's languages are both morphologically complex, and have no large dataset of "gold" segmentations into meaningful units. To solve this problem, we propose a new type of Segmental Language Model (Sun and Deng, 2018;Kawakami et al, 2019;Wang et al, 2021), for use in both unsupervised and lightly supervised segmentation tasks. We introduce a Masked Segmental Language Model (MSLM) built on a spanmasking transformer architecture, harnessing the power of a bi-directional masked modeling context and attention.…”

mentioning

confidence: 99%

A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

Downey¹,

Xia²,

Levow³

et al. 2021

Preprint

View full text Add to dashboard Cite

Segmentation remains an important preprocessing step both in languages where "words" or other important syntactic/semantic units (like morphemes) are not clearly delineated by white space, as well as when dealing with continuous speech data, where there is often no meaningful pause between words. Nearperfect supervised methods have been developed for use in resource-rich languages such as Chinese, but many of the world's languages are both morphologically complex, and have no large dataset of "gold" segmentations into meaningful units. To solve this problem, we propose a new type of Segmental Language Model (Sun and Deng, 2018;Kawakami et al., 2019;Wang et al., 2021), for use in both unsupervised and lightly supervised segmentation tasks. We introduce a Masked Segmental Language Model (MSLM) built on a spanmasking transformer architecture, harnessing the power of a bi-directional masked modeling context and attention. In a series of experiments, our model consistently outperforms Recurrent SLMs on Chinese (PKU Corpus) in segmentation quality, and performs similarly to the Recurrent model on English (PTB). We conclude by discussing the different challenges posed in segmenting phonemictype writing systems.

show abstract

An Enhanced New Word Identification Approach Using Bilingual Alignment

Yang¹,

Zhang²,

Shang³

et al. 2022

Natural Language Processing and Chinese Computing

View full text Add to dashboard Cite

Unsupervised Neural Word Segmentation for Chinese via Segmental Language Modeling

Cited by 18 publications

References 13 publications

Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages

Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages

A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

An Enhanced New Word Identification Approach Using Bilingual Alignment

Contact Info

Product

Resources

About