Bootstrapping a multilingual part-of-speech tagger in one person-day

Cucerzan, Silviu; Yarowsky, David

doi:10.3115/1118853.1118859

Cited by 22 publications

(28 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There has been some previous work on boostrapping POS taggers (e.g., Zavrel and Daelemans (2000) and Cucerzan and Yarowsky (2002)), but to our knowledge no previous work on co-training POS taggers.…”

Section: Introductionmentioning

confidence: 99%

Bootstrapping POS taggers using unlabelled data

Clark

Curran

Osborne

2003

Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 -

View full text Add to dashboard Cite

This paper investigates booststrapping part-ofspeech taggers using co-training, in which two taggers are iteratively re-trained on each other's output. Since the output of the taggers is noisy, there is a question of which newly labelled examples to add to the training set. We investigate selecting examples by directly maximising tagger agreement on unlabelled data, a method which has been theoretically and empirically motivated in the co-training literature. Our results show that agreement-based co-training can significantly improve tagging performance for small seed datasets. Further results show that this form of co-training considerably outperforms self-training. However, we find that simply re-training on all the newly labelled data can, in some cases, yield comparable results to agreement-based co-training, with only a fraction of the computational cost.

show abstract

Section: Introductionmentioning

confidence: 99%

Bootstrapping POS taggers using unlabelled data

Clark

Curran

Osborne

2003

Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 -

View full text Add to dashboard Cite

show abstract

“…Bootstrapping is used to create labelled training data from large amounts of unlabelled data (Cucerzan and Yarowsky, 2002).…”

Section: The Bootstrapping Methodsmentioning

confidence: 99%

Arabic Tweets Treebanking and Parsing: A Bootstrapping Approach

Albogamy¹,

Ramsay²,

Ahmed³

2017

Proceedings of the Third Arabic Natural Language Processing Workshop

View full text Add to dashboard Cite

In this paper, we propose using a "bootstrapping" method for constructing a dependency treebank of Arabic tweets. This method uses a rule-based parser to create a small treebank of one thousand Arabic tweets and a data-driven parser to create a larger treebank by using the small treebank as a seed training set. We are able to create a dependency treebank from unlabelled tweets without any manual intervention. Experiments results show that this method can improve the speed of training the parser and the accuracy of the resulting parsers.

show abstract

“…2008; Oflazer et al. 2001), and manual encoding of basic linguistic facts (e.g., Cucerzan and Yarowsky 2002; Feldman and Hana 2010; Tepper and Xia 2010). Learning from a different language (e.g., Bosch et al.…”

Section: Introductionmentioning

confidence: 99%

“…Learning from a different language (e.g., Bosch et al. 2008; Cucerzan and Yarowsky 2002; Feldman and Hana 2010), another resource‐light strategy, will be discussed in our forthcoming survey (Feldman and Hana forthcoming).…”

Section: Introductionmentioning

confidence: 99%

Resource‐Light Approaches to Computational Morphology Part 1: Monolingual Approaches

Hana

Feldman

2012

Language and Linguist. Compass

View full text Add to dashboard Cite

This article surveys resource-light monolingual approaches to morphological analysis and tagging. While supervised analyzers and taggers are very accurate, they are extremely expensive to create. Therefore, most of the world languages and dialects have no realistic prospect for morphological tools created in this way. The weakly-supervised approaches aim to minimize time, expertise and/ or financial cost needed for their development. We discuss the algorithms and their performance considering issues such as accuracy, portability, development time and granularity of the output.

show abstract

Bootstrapping a multilingual part-of-speech tagger in one person-day

Cited by 22 publications

References 9 publications

Bootstrapping POS taggers using unlabelled data

Bootstrapping POS taggers using unlabelled data

Arabic Tweets Treebanking and Parsing: A Bootstrapping Approach

Resource‐Light Approaches to Computational Morphology Part 1: Monolingual Approaches

Contact Info

Product

Resources

About