This paper addresses the problem of correcting spelling errors that result in valid, though unintended words (such as peace and piece, or quiet and quite) and also the problem of correcting particular word usage errors (such as amount and number, or among and between). Such corrections require contextual information and are not handled by conventional spelling programs such as Unix spell. First, we introduce a method called Trigrams that uses part-of-speech trigrams to encode the context. This method uses a small number of parameters compared to previous methods based on word trigrams. However, it is e ectively unable to distinguish among words that have the same part of speech. For this case, an alternative feature-based method called Bayes performs better; but Bayes is less e ective than Trigrams when the distinction among words depends on syntactic constraints. A hybrid method called Tribayes is then introduced that combines the best of the previous two methods. The improvement in performance of Tribayes over its components is veri ed experimentally. Tribayes is also compared with the grammar checker in Microsoft Word, and is found to have substantially higher performance.
Abstract.A large class of machine-learning problems in natural language require the characterization of linguistic context. Two characteristic properties of such problems are that their feature space is of very high dimensionality, and their target concepts depend on only a small subset of the features in the space. Under such conditions, multiplicative weight-update algorithms such as Winnow have been shown to have exceptionally good theoretical properties. In the work reported here, we present an algorithm combining variants of Winnow and weightedmajority voting, and apply it to a problem in the aforementioned class: context-sensitive spelling correction. This is the task of fixing spelling errors that happen to result in valid words, such as substituting to for too, casual for causal, and so on. We evaluate our algorithm, WinSpell, by comparing it against BaySpell, a statistics-based method representing the state of the art for this task. We find: (1) When run with a full (unpruned) set of features, WinSpell achieves accuracies significantly higher than BaySpell was able to achieve in either the pruned or unpruned condition; (2) When compared with other systems in the literature, WinSpell exhibits the highest performance; (3) While several aspects of WinSpell's architecture contribute to its superiority over BaySpell, the primary factor is that it is able to learn a better linear separator than BaySpell learns; (4) When run on a test set drawn from a different corpus than the training set was drawn from, WinSpell is better able than BaySpell to adapt, using a strategy we will present that combines supervised learning on the training set with unsupervised learning on the (noisy) test set.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.