Universal dependencies (UD) is a framework for morphosyntactic annotation of human language, which to date has been used to create treebanks for more than 100 languages. In this article, we outline the linguistic theory of the UD framework, which draws on a long tradition of typologically oriented grammatical theories. Grammatical relations between words are centrally used to explain how predicate–argument structures are encoded morphosyntactically in different languages while morphological features and part-of-speech classes give the properties of words. We argue that this theory is a good basis for cross-linguistically consistent annotation of typologically diverse languages in a way that supports computational natural language understanding as well as broader linguistic studies.
In pursuing machine understanding of human language, highly accurate syntactic analysis is a crucial step. In this work, we focus on dependency grammar, which models syntax by encoding transparent predicate-argument structures. Recent advances in dependency parsing have shown that employing higherorder subtree structures in graph-based parsers can substantially improve the parsing accuracy. However, the inefficiency of this approach increases with the order of the subtrees. This work explores a new reranking approach for dependency parsing that can utilize complex subtree representations by applying efficient subtree selection methods. We demonstrate the effectiveness of the approach in experiments conducted on the Penn Treebank and the Chinese Treebank. Our system achieves the best performance among known supervised systems evaluated on these datasets, improving the baseline accuracy from 91.88% to 93.42% for English, and from 87.39% to 89.25% for Chinese.
The focus of recent studies on Chinese word segmentation, part-of-speech (POS) tagging and parsing has been shifting from words to characters. However, existing methods have not yet fully utilized the potentials of Chinese characters. In this paper, we investigate the usefulness of character-level part-of-speech in the task of Chinese morphological analysis. We propose the first tagset designed for the task of character-level POS tagging. We propose a method that performs character-level POS tagging jointly with word segmentation and word-level POS tagging. Through experiments, we demonstrate that by introducing character-level POS information, the performance of a baseline morphological analyzer can be significantly improved.
Chinese word segmentation is an initial and important step in Chinese language processing. Recent advances in machine learning techniques have boosted the performance of Chinese word segmentation systems, yet the identification of out-ofvocabulary words is still a major problem in this field of study. Recent research has attempted to address this problem by exploiting characteristics of frequent substrings in unlabeled data. We propose a simple yet effective approach for extracting a specific type of frequent substrings, called maximized substrings, which provide good estimations of unknown word boundaries. In the task of Chinese word segmentation, we use these substrings which are extracted from large scale unlabeled data to improve the segmentation accuracy. The effectiveness of this approach is demonstrated through experiments using various data sets from different domains. In the task of unknown word extraction, we apply post-processing techniques that effectively reduce the noise in the extracted substrings. We demonstrate the effectiveness and efficiency of our approach by comparing the results with a widely applied Chinese word recognition method in a previous study.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.