Comparing a linguistic and a stochastic tagger

Samuelsson, Christer; Voutilainen, A

doi:10.3115/979617.979649

Cited by 26 publications

(28 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…To the best of our knowledge, this is the first tagging study that reaches a 98% accuracy level for a data-driven tagger (which must be distinguished from linguistically backuped taggers which come with 'heavy' parsing machinery (Samuelsson and Voutilainen, 1997)). Still, we deal with a specialized sublanguage simpler in structure compared with newspaper language, although we kept it diverse through the various text genres.…”

Section: Discussionmentioning

confidence: 99%

High-performance tagging on medical texts

Hahn

Wermter

2004

Proceedings of the 20th International Conference on Computational Linguistics - COLING '04

View full text Add to dashboard Cite

We ran both Brill's rule-based tagger and TNT, a statistical tagger, with a default German newspaper-language model on a medical text corpus. Supplied with limited lexicon resources, TNT outperforms the Brill tagger with state-of-the-art performance figures (close to 97% accuracy). We then trained TNT on a large annotated medical text corpus, with a slightly extended tagset that captures certain medical language particularities, and achieved 98% tagging accuracy. Hence, statistical off-the-shelf POS taggers cannot only be immediately reused for medical NLP, but they also -when trained on medical corpora -achieve a higher performance level than for the newspaper genre.

show abstract

Section: Discussionmentioning

confidence: 99%

High-performance tagging on medical texts

Hahn

Wermter

2004

Proceedings of the 20th International Conference on Computational Linguistics - COLING '04

View full text Add to dashboard Cite

show abstract

“…In computational linguistics, the main work that has been done on improving the taxonomy of tags to allow clearer automatic tagging and improving the conventions by which tags are assigned has been done within the English Constraint Grammar tradition [18,19]. Contrary to the results above, this work has achieved quite outstanding interannotator agreement (up to 99.3% prior to adjudication), in part by the exhaustiveness of the conventions for tagging but also in part by simplifying decisions for tagging (e.g., all -ing participles that premodify a noun are tagged as adjectives, regardless).…”

mentioning

confidence: 99%

Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?

Manning

2011

Lecture Notes in Computer Science

256

144

View full text Add to dashboard Cite

Abstract. I examine what would be necessary to move part-of-speech tagging performance from its current level of about 97.3% token accuracy (56% sentence accuracy) to close to 100% accuracy. I suggest that it must still be possible to greatly increase tagging performance and examine some useful improvements that have recently been made to the Stanford Part-of-Speech Tagger. However, an error analysis of some of the remaining errors suggests that there is limited further mileage to be had either from better machine learning or better features in a discriminative sequence classifier. The prospects for further gains from semisupervised learning also seem quite limited. Rather, I suggest and begin to demonstrate that the largest opportunity for further progress comes from improving the taxonomic basis of the linguistic resources from which taggers are trained. That is, from improved descriptive linguistics. However, I conclude by suggesting that there are also limits to this process. The status of some words may not be able to be adequately captured by assigning them to one of a small number of categories. While conventions can be used in such cases to improve tagging consistency, they lack a strong linguistic basis. Isn't Part-of-Speech Tagging a Solved Task?At first glance, current part-of-speech taggers work rapidly and reliably, with per-token accuracies of slightly over 97% [1][2][3][4]. Looked at more carefully, the story is not quite so rosy. This evaluation measure is easy both because it is measured per-token and because you get points for every punctuation mark and other tokens that are not ambiguous. It is perhaps more realistic to look at the rate of getting whole sentences right, since a single bad mistake in a sentence can greatly throw off the usefulness of a tagger to downstream tasks such as dependency parsing. Current good taggers have sentence accuracies around 55-57%, which is a much more modest score. Accuracies also drop markedly when there are differences in topic, epoch, or writing style between the training and operational data.Still, the perception has been that same-epoch-and-domain part-of-speech tagging is a solved problem, and its accuracy cannot really be pushed higher. I

show abstract

“…Chanod and Tapanainen (1995) and Samuelsson and Voutilainen (1997) present comparisons between linguistic and statistic taggers.…”

Section: Existing Approaches To Pos Taggingmentioning

confidence: 99%

Untitled

2000

View full text Add to dashboard Cite

Abstract.We have applied the inductive learning of statistical decision trees and relaxation labeling to the Natural Language Processing (NLP) task of morphosyntactic disambiguation (Part Of Speech Tagging). The learning process is supervised and obtains a language model oriented to resolve POS ambiguities, consisting of a set of statistical decision trees expressing distribution of tags and words in some relevant contexts. The acquired decision trees have been directly used in a tagger that is both relatively simple and fast, and which has been tested and evaluated on the Wall Street Journal (WSJ) corpus with competitive accuracy. However, better results can be obtained by translating the trees into rules to feed a flexible relaxation labeling based tagger. In this direction we describe a tagger which is able to use information of any kind (n-grams, automatically acquired constraints, linguistically motivated manually written constraints, etc.), and in particular to incorporate the machine-learned decision trees. Simultaneously, we address the problem of tagging when only limited training material is available, which is crucial in any process of constructing, from scratch, an annotated corpus. We show that high levels of accuracy can be achieved with our system in this situation, and report some results obtained when using it to develop a 5.5 million words Spanish corpus from scratch.

show abstract

Comparing a linguistic and a stochastic tagger

Cited by 26 publications

References 3 publications

High-performance tagging on medical texts

High-performance tagging on medical texts

Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?

Untitled

Contact Info

Product

Resources

About