Active learning for part-of-speech tagging

Ringger, Eric K.; Morales, Peter; Haertel, Robbie; Busby, George; Carmen, Marc; Carroll, James L.; Seppi, Kevin; Lonsdale, Deryle

doi:10.3115/1642059.1642075

Cited by 37 publications

(29 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The direction that Ringger et al (2007) pursue is perhaps the most similar to ours. They attempt to reduce supervision required for high POS tagging performance based on active learning.…”

Section: Related Workmentioning

confidence: 64%

Simple Semi-Supervised POS Tagging

Stratos

Collins

2015

Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing

View full text Add to dashboard Cite

We tackle the question: how much supervision is needed to achieve state-of-the-art performance in part-of-speech (POS) tagging, if we leverage lexical representations given by the model of Brown et al. (1992)? It has become a standard practice to use automatically induced "Brown clusters" in place of POS tags. We claim that the underlying sequence model for these clusters is particularly well-suited for capturing POS tags. We empirically demonstrate this claim by drastically reducing supervision in POS tagging with these representations. Using either the bit-string form given by the algorithm of Brown et al. (1992) or the (less well-known) embedding form given by the canonical correlation analysis algorithm of Stratos et al. (2014), we can obtain 93% tagging accuracy with just 400 labeled words and achieve state-of-the-art accuracy (> 97%) with less than 1 percent of the original training data.

show abstract

“…The direction that Ringger et al (2007) pursue is perhaps the most similar to ours. They attempt to reduce supervision required for high POS tagging performance based on active learning.…”

Section: Related Workmentioning

confidence: 64%

Simple Semi-Supervised POS Tagging

Stratos

Collins

2015

Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…Active learning (further elaborated on in Section 5) has previously been successfully applied to a number of language technology tasks, including information extraction [26,6], named entity recognition [27,30], text categorization [14,11], and part-of-speech tagging [7,25]. When applicable, the active learning paradigm has the desirable effect of creating high performing classifiers using less data than required by competitive classifiers trained on a random selection of data.…”

Section: The Bootmark Methodsmentioning

confidence: 99%

On privacy preservation in text and document-based active learning for named entity recognition

Olsson

2009

Proceedings of the ACM First International Workshop on Privacy and Anonymity for Very Large Databases

View full text Add to dashboard Cite

The preservation of the privacy of persons mentioned in text requires the ability to automatically recognize and identify names. Named entity recognition is a mature field and most current approaches are based on supervised machine learning techniques. Such learning requires the presence of labeled examples on which to train; training examples are usually provided to the learner on the form of annotated corpora. Creating and annotating corpora is a tedious, meticulous and error prone process; obtaining good training examples is a hard task in itself. This paper describes the development and in-depth empirical investigation of a method, called BootMark, for bootstrapping the marking up of named entities in textual documents. Experimental results show that BootMark requires a human annotator to manually annotate fewer documents in order to produce a named entity recognizer with a given performance, than would be needed if the documents forming the basis for the recognizer were randomly drawn from the same corpus. The investigation further indicates that the primary gain obtained by BootMark compared to passive learning is in terms of higher recall. Thus, it is argued, the recognizers are suitable for use in privacy preservation applications.

show abstract

“…One of the important tasks for the future is the compilation of a part-of-speech annotated corpus, which will allow us to build more robust disambiguation models. The tagger presented in this paper, while imperfect, can be useful in the process of creating such corpus, e.g., by applying it in an active learning scenario [61].…”

Section: Tokenizationmentioning

confidence: 99%

Improving Basic Natural Language Processing Tools for the Ainu Language

et al. 2019

View full text Add to dashboard Cite

Ainu is a critically endangered language spoken by the native inhabitants of northern Japan. This paper describes our research aimed at the development of technology for automatic processing of text in Ainu. In particular, we improved the existing tools for normalizing old transcriptions, word segmentation, and part-of-speech tagging. In the experiments we applied two Ainu language dictionaries from different domains (literary and colloquial) and created a new data set by combining them. The experiments revealed that expanding the lexicon had a positive impact on the overall performance of our tools, especially with test data unrelated to any of the training sets used.The aim of this research is to develop technologies for automatic processing of Ainu-a language isolate that is native to northern parts of Japan, which is currently recognized as nearly extinct (e.g., by Lewis et al. [13]).In particular, we aimed at improving the part-of-speech tagger for the Ainu language (POST-AL), a tool for computer-supported linguistic analysis of the Ainu language, initially developed by Ptaszynski and Momouchi [14].The task of developing NLP tools for Ainu poses several challenges. Firstly, large-scale digital language resources required for many NLP tasks (such as annotated corpora) are not available for the Ainu language. In this paper we describe our attempt to solve this problem by merging two different digitized dictionaries into one data set. Secondly, there exists no single standard for transcription and word segmentation of the Ainu language, especially in texts collected in earlier years. To address that problem, POST-AL has been equipped with the functions of transcription normalization and word segmentation. In this paper we describe in detail the proposed methodology including recent improvements. Another functionality of POST-AL is part-of-speech (POS) tagging. To improve this accuracy we developed a hybrid method of POS disambiguation, combining lexical n-grams and term frequency. The results of evaluation experiments presented in this paper show that there are differences in part-of-speech classification of certain forms between authors of different dictionaries and text annotations, which creates yet another challenge, to be tackled in the future.The remainder of this paper is organized as follows. In Section 2 we briefly describe the characteristics and the current status of the Ainu language. In Section 3 we provide an overview of some of the previous studies on the Ainu language, including the few existing research projects in the field of natural language processing. Section 4 presents our algorithms for normalization, word segmentation and part-of-speech tagging. In Sections 5 and 6 we introduce the training data (dictionaries) and test data used in this research. Section 7 summarizes the evaluation methods we applied. In Section 8 we present the results of the evaluation experiments. Finally, Section 9 contains conclusions and some ideas for future improvements.

show abstract

Active learning for part-of-speech tagging

Cited by 37 publications

References 15 publications

Simple Semi-Supervised POS Tagging

Simple Semi-Supervised POS Tagging

On privacy preservation in text and document-based active learning for named entity recognition

Improving Basic Natural Language Processing Tools for the Ainu Language

Contact Info

Product

Resources

About