An efficient implementation of a part-of-speech tagger for Swedish is described. The stochastic tagger uses a well-established Markov model of the language. The tagger tags 92 per cent of unknown words correctly and up to 97 per cent of all words. Several implementation and optimization considerations are discussed. The main contribution of this paper is the thorough description of the tagging algorithm and the addition of a number of improvements. The paper contains enough detail for the reader to construct a tagger for his own language. 816 J. CARLBERGER AND V. KANN grammar checking. The applications require the tagger to be both efficient (to tag quickly, especially important in information retrieval), and accurate (to tag correctly, especially important in translation). In some applications, it is not even enough to have the text syntactically disambiguated -a word sense disambiguation is needed, and that is an even harder problem [1].Part-of-speech taggers can be constructed in various ways, and different types of taggers have different advantages. Taggers can be based on stochastic models [2-7], on rules [8,9], or on neural networks [10]. In a recent paper, Samuelsson and Voutilainen claim that rule-based taggers can give higher tagging accuracy than plain stochastic taggers on correct texts [11]. However, hybrids between rule-based taggers and stochastic taggers might be even better [12].Some different stochastic models for tagging unknown words exist [2,4]. A good survey of automatic stochastic part-of-speech tagging is Charniak [13].In this paper, we describe an implementation of a part-of-speech tagger for Swedish. We wanted the tagger to be easy to implement, fast, language independent, tag set independent, and that it should give high accuracy of tagging. We also wanted the tagger to be able to cope with unknown words and grammatically erroneous sentences. This ability is needed in various applications, such as grammar and spell checking.Given these requirements, we chose to construct a stochastic tagger based on a Markov model. Our goal was to achieve 95 per cent tagging accuracy for known words and 70 per cent accuracy for unknown words, and we both reached and surpassed the goal.We use the tagger in a grammar checking program for Swedish, named GRANSKA, but we designed it to be as language independent as possible, and we think that it can be used for most inflectional languages, for any tag set, and in any application needing part-of-speech tagging. As it turned out, when incorporated into GRANSKA, our tagger actually became a hybrid between a stochastic tagger and a rule-based tagger. For certain complicated cases where the stochastic tagger could be wrong, we use rules to find the correct tagging. THE TAGGING MODEL Markov modelIn this section, we briefly describe the Markov model that is used as a stochastic model of the language. A complete and excellent description of the equations used in the standard Markov model for part-of-speech tagging can be found in Charniak et al. [2].
A number of word prediction systems that are commonly used by individuals requiring augmentative and alternative communication (AAC) include both syntactic information with their related algorithms and various heuristic methods, such as recency promotion and word learning, in an attempt to both reduce keystrokes and make predictions more efficient to the user. It has been suggested that word predictors with syntactic and heuristic knowledge could make more accurate and appropriate predictions; that these predictions might, as a result, present a lower cognitive load for the user; and that users with dyslexia might also be steered away from confusing predictions by more appropriate predictions (Tyvand & Demasco, 1993). The motivation to provide such benefits has often inspired researchers in this area despite disappointingly small increases in keystroke savings. This article describes a system in which probabilistic models of word sequences and word class sequences are integrated with complementary heuristic methods to improve both the quality of words predicted and the keystroke savings. A review of research in syntactic and heuristic methods in word prediction and a discussion of possible language dependence precedes a description of the study. PREVIOUS RESEARCH SyntaxSyntactic information that has been accessed to improve word prediction has included statistics for word class sequences and rules for grammatically correct sentence structure in a number of languages. Algorithms include various types of parsers and probabilistic methods, such as Markov models and artificial neural networks (ANNs). At the present time, a few commercial systems, including Aurora and Co:Writer ® , use some type of grammatical information (Boekestein, 1996). Multiword prediction (usually word pairs) is also provided in a number of systems (e.g., Aurora 3.0 for Windows, 1 Co:Writer ® , 2 EZ Keys, 3 Finish Line, 4 KeyCache 5 ).To include syntactic information in an early version of the Swedish word prediction system, a 10,000-word lexicon was marked with word class information. These word classes were typically used when studying the grammar of a language: for Swedish, these included noun, verb, adjective, and function word classes, as well as subclasses such as singular/plural, gender, definite/indefinite for nouns and adjectives, and tense for verbs. A study was first done to determine the maximal possible savings in keystrokes if word class were known. In a 1,331-word sample from a somewhat telegraphic personal communicator text, it was found that 0743-4618/01/1704-0255 $3.00/0; Volume 17, December 2001 AAC Augmentative and Alternative CommunicationThe goal of this project was to design and implement a new word predictor for Swedish that would suggest words that are more grammatically appropriate, thus presenting a lower cognitive load for users and saving significantly more keystrokes than the previous predictor. The new predictor that was designed and developed uses a probabilistic language model based on the well-established idea...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.