Universal Word Segmentation: Implementation and Interpretation

Shao, Yan; Hardmeier, Christian; Nivre, Joakim

doi:10.1162/tacl_a_00033

Cited by 37 publications

(34 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use the default parameter settings introduced by Shao et al (2018) and train a segmentation model for all treebanks with at least 50 sentences of training data. For treebanks with less or no training data (except Thai discussed below), we substitute a model for another treebank/language:…”

Section: Sentence and Word Segmentationmentioning

confidence: 99%

“…The Uppsala system focuses exclusively on LAS and MLAS, and consists of a three-step pipeline. The first step is a model for joint sentence and word segmentation which uses the BiRNN-CRF framework of Shao et al (2017Shao et al ( , 2018 to predict sentence and word boundaries in the raw input and simultaneously marks multiword tokens that need non-segmental analysis. The second component is a part-of-speech (POS) tagger based on Bohnet et al (2018), which employs a sentencebased character model and also predicts morphological features.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

82 Treebanks, 34 Models: Universal Dependency Parsing with Multi-Treebank Models

et al. 2018

Self Cite

View full text Add to dashboard Cite

We present the Uppsala system for the CoNLL 2018 Shared Task on universal dependency parsing. Our system is a pipeline consisting of three components: the first performs joint word and sentence segmentation; the second predicts part-ofspeech tags and morphological features; the third predicts dependency trees from words and tags. Instead of training a single parsing model for each treebank, we trained models with multiple treebanks for one language or closely related languages, greatly reducing the number of models. On the official test run, we ranked 7th of 27 teams for the LAS and MLAS metrics.Our system obtained the best scores overall for word segmentation, universal POS tagging, and morphological features.Corrigendum: After the test phase was over, we discovered that we had used a non-permitted resource when developing the UPOS tagger for Thai PUD (see Section 4). Setting our LAS, MLAS and UPOS scores to 0.00 for Thai PUD gives the corrected scores: LAS 72.31, MLAS 59.17, UPOS 90.50. This does not affect the ranking for any of the three scores, as confirmed by the shared task organizers. ResourcesAll three components of our system were trained principally on the training sets of Universal Dependencies v2.2 released to coincide with the shared task . The tagger and parser also make use of the pre-trained word

show abstract

Section: Sentence and Word Segmentationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

82 Treebanks, 34 Models: Universal Dependency Parsing with Multi-Treebank Models

et al. 2018

Self Cite

View full text Add to dashboard Cite

show abstract

“…During the review period of this paper, a paper byShao et al (2018) appeared which nearly matches the performance of yap on Hebrew segmentation using an RNN approach. Achieving an F-score of 91.01 compared to yap's score of 91.05, but on a dataset with slightly different splits, this system gives a good baseline for a tuned RNN-based system.…”

mentioning

confidence: 77%

A Characterwise Windowed Approach to Hebrew Morphological Segmentation

Zeldes¹

2018

Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology

View full text Add to dashboard Cite

This paper presents a novel approach to the segmentation of orthographic word forms in contemporary Hebrew, focusing purely on splitting without carrying out morphological analysis or disambiguation. Casting the analysis task as character-wise binary classification and using adjacent character and wordbased lexicon-lookup features, this approach achieves over 98% accuracy on the benchmark SPMRL shared task data for Hebrew, and 97% accuracy on a new out of domain Wikipedia dataset, an improvement of ≈4% and 5% over previous state of the art performance.

show abstract

“…While the early works belonging to this category relied on "traditional" classification techniques, such as maximum entropy models [40] and Conditional Random Fields [41], in recent studies neural architectures are being actively explored [23,27,28,30,42]. In 2018, Shao et al [43] released a language-independent character sequence tagging model based on recurrent neural networks with Conditional Random Fields interface, designed for performing word segmentation in the Universal Dependencies framework. It obtained state-of-the-art accuracies on a wide range of languages.…”

Section: Related Workmentioning

confidence: 99%

MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

2019

View full text Add to dashboard Cite

Word segmentation is an essential task in automatic language processing for languages where there are no explicit word boundary markers, or where space-delimited orthographic words are too coarse-grained. In this paper we introduce the MiNgMatch Segmenter-a fast word segmentation algorithm, which reduces the problem of identifying word boundaries to finding the shortest sequence of lexical n-grams matching the input text. In order to validate our method in a low-resource scenario involving extremely sparse data, we tested it with a small corpus of text in the critically endangered language of the Ainu people living in northern parts of Japan. Furthermore, we performed a series of experiments comparing our algorithm with systems utilizing state-of-the-art lexical n-gram-based language modelling techniques (namely, Stupid Backoff model and a model with modified Kneser-Ney smoothing), as well as a neural model performing word segmentation as character sequence labelling. The experimental results we obtained demonstrate the high performance of our algorithm, comparable with the other best-performing models. Given its low computational cost and competitive results, we believe that the proposed approach could be extended to other languages, and possibly also to other Natural Language Processing tasks, such as speech recognition.

show abstract

Universal Word Segmentation: Implementation and Interpretation

Cited by 37 publications

References 16 publications

82 Treebanks, 34 Models: Universal Dependency Parsing with Multi-Treebank Models

82 Treebanks, 34 Models: Universal Dependency Parsing with Multi-Treebank Models

A Characterwise Windowed Approach to Hebrew Morphological Segmentation

MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

Contact Info

Product

Resources

About