Bootstrapping statistical parsers from small datasets

Steedman, Mark; Osborne, Miles; Sarkar, Anoop; Clark, Stephen; Hwa, Rebecca; Hockenmaier, Julia; Ruhlen, Paul; Baker, Steven; Crim, Jeremiah

doi:10.3115/1067807.1067851

Cited by 93 publications

(77 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…However, although co-training has been used in many domains such as statistical parsing and noun phrase identification [22], [29], [33], [38], in most scenarios the requirement of sufficient and redundant views, or even the requirement of sufficient redundancy, could not be met. Therefore, researchers attempt to develop variants of the co-training algorithm for relaxing such a requirement.…”

Section: Semi-supervised Learningmentioning

confidence: 99%

“…This algorithm employs two regressors each of which labels the unlabeled data for the other during the learning process. In order to choose appropriate unlabeled examples to label, COREG estimates the labeling confidence by consulting the influence July 31, 2007 DRAFT of the labeling of unlabeled examples on the labeled examples. The final prediction is made by combining the regression estimates generated by both regressors.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Semisupervised Regression with Cotraining-Style Algorithms

Zhou

2007

IEEE Trans. Knowl. Data Eng.

182

View full text Add to dashboard Cite

The traditional setting of supervised learning requires a large amount of labeled training examples in order to achieve good generalization. However, in many practical applications, unlabeled training examples are readily available but labeled ones are fairly expensive to obtain. Therefore, semi-supervised learning has attracted much attention. Previous research on semi-supervised learning mainly focuses on semi-supervised classification. Although regression is almost as important as classification, semisupervised regression is largely understudied. In particular, although co-training is a main paradigm in semi-supervised learning, few works has been devoted to co-training style semi-supervised regression algorithms. In this paper, a co-training style semi-supervised regression algorithm, i.e. COREG, is proposed. This algorithm uses two regressors each labels the unlabeled data for the other regressor, where the confidence in labeling an unlabeled example is estimated through the amount of reduction in mean square error over the labeled neighborhood of that example. Analysis and experiments show that COREG can effectively exploit unlabeled data to improve regression estimates.

show abstract

Section: Semi-supervised Learningmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Semisupervised Regression with Cotraining-Style Algorithms

Zhou

2007

IEEE Trans. Knowl. Data Eng.

182

View full text Add to dashboard Cite

show abstract

“…One actively researched approach to this problem is to develop weakly supervised algorithms that require less training data, such as active learning (Hermjakob and Mooney 1997;Tang et al 2002;Baldridge and Osborne 2003;Hwa 2004) and co-training (Sarkar 2001;Steedman et al 2003). In this article, we explore an alternative: using parallel text as a means for transferring syntactic knowledge from a resource-rich language to a language with fewer resources.…”

Section: Introductionmentioning

confidence: 99%

Bootstrapping parsers via syntactic projection across parallel texts

et al. 2005

Self Cite

View full text Add to dashboard Cite

Broad coverage, high quality parsers are available for only a handful of languages. A prerequisite for developing broad coverage parsers for more languages is the annotation of text with the desired linguistic representations (also known as "treebanking"). However, syntactic annotation is a labor intensive and time-consuming process, and it is difficult to find linguistically annotated text in sufficient quantities. In this article, we explore using parallel text to help solving the problem of creating syntactic annotation in more languages. The central idea is to annotate the English side of a parallel corpus, project the analysis to the second language, and then train a stochastic analyzer on the resulting noisy annotations. We discuss our background assumptions, describe an initial study on the "projectability" of syntactic relations, and then present two experiments in which stochastic parsers are developed with minimal human intervention via projection from English.

show abstract

“…Co-training (Blum and Mitchell, 1998), and several variants of co-training, have been applied to a number of NLP problems, including word sense disambiguation (Yarowsky, 1995), named entity recognition (Collins and Singer, 1999), noun phrase bracketing (Pierce and Cardie, 2001) and statistical parsing (Sarkar, 2001;Steedman et al, 2003). In each case, co-training was used successfully to bootstrap a model from only a small amount of labelled data and a much larger pool of unlabelled data.…”

Section: Introductionmentioning

confidence: 99%

Bootstrapping POS taggers using unlabelled data

Clark

Curran

Osborne

2003

Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 -

Self Cite

View full text Add to dashboard Cite

This paper investigates booststrapping part-ofspeech taggers using co-training, in which two taggers are iteratively re-trained on each other's output. Since the output of the taggers is noisy, there is a question of which newly labelled examples to add to the training set. We investigate selecting examples by directly maximising tagger agreement on unlabelled data, a method which has been theoretically and empirically motivated in the co-training literature. Our results show that agreement-based co-training can significantly improve tagging performance for small seed datasets. Further results show that this form of co-training considerably outperforms self-training. However, we find that simply re-training on all the newly labelled data can, in some cases, yield comparable results to agreement-based co-training, with only a fraction of the computational cost.

show abstract

Bootstrapping statistical parsers from small datasets

Cited by 93 publications

References 9 publications

Semisupervised Regression with Cotraining-Style Algorithms

Semisupervised Regression with Cotraining-Style Algorithms

Bootstrapping parsers via syntactic projection across parallel texts

Bootstrapping POS taggers using unlabelled data

Contact Info

Product

Resources

About