Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural 2009
DOI: 10.3115/1687878.1687948
|View full text |Cite
|
Sign up to set email alerts
|

Distributional representations for handling sparsity in supervised sequence-labeling

Abstract: Supervised sequence-labeling systems in natural language processing often suffer from data sparsity because they use word types as features in their prediction tasks. Consequently, they have difficulty estimating parameters for types which appear in the test set, but seldom (or never) appear in the training set. We demonstrate that distributional representations of word types, trained on unannotated text, can be used to improve performance on rare words. We incorporate aspects of these representations into the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
54
0

Year Published

2012
2012
2015
2015

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 47 publications
(54 citation statements)
references
References 22 publications
0
54
0
Order By: Relevance
“…In addition, the thresholding of these combinatorial features by simple counts effectively suppresses the combinatorial increase of the parameters. At the same time, although global information had also been used in several reports (Nakagawa and Matsumoto, 2006;Huang and Yates, 2009;Turian et al, 2010;Schnabel and Schütze, 2014), the nonlinear interactions of these features were not well investigated since these features are often dense continuous features and the explicit non-linear expansions are counterintuitive and drastically increase the number of the model parameters. In our work, we investigate neural networks used to represent the non-linearity of global information for POS tagging in a compact way.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In addition, the thresholding of these combinatorial features by simple counts effectively suppresses the combinatorial increase of the parameters. At the same time, although global information had also been used in several reports (Nakagawa and Matsumoto, 2006;Huang and Yates, 2009;Turian et al, 2010;Schnabel and Schütze, 2014), the nonlinear interactions of these features were not well investigated since these features are often dense continuous features and the explicit non-linear expansions are counterintuitive and drastically increase the number of the model parameters. In our work, we investigate neural networks used to represent the non-linearity of global information for POS tagging in a compact way.…”
Section: Introductionmentioning
confidence: 99%
“…All of them are continuous dense features and we use a feed-forward neural network to exploit the non-linearity of these features. Although all of them except (3) have been used for POS tagging in previous work (Nakamura et al, 1990;Schmid, 1994;Schnabel and Schütze, 2014;Huang and Yates, 2009), we propose a neural network approach to capture the non-linear interactions of these features. By feeding these features into neural networks as an input vector, we can expect our tagger can handle not only the nonlinearity of the N-grams of the same kinds of features but also the non-linear interactions among the different kind of features.…”
Section: Introductionmentioning
confidence: 99%
“…Some of them focused on how to use a small amount of labeled data from a target domain in conjunction with a large amount of labeled data from a source domain [8]- [12]. Other works on domain adaption (DA) focused on adapting their models from the perspective of learning, based on the labeled data sets of the source and target domains [13], [14].…”
Section: Related Researchmentioning
confidence: 99%
“…At tagging time, a sentence is tagged by the model that is most similar to that sentence. Huang and Yates (2009) train a Conditional Random Field (CRF) tagger with features retrieved from a smoothing model trained using both source and target domain unlabeled data. Adding latent states to the smoothing model further improves the POS tagging accuracy (Huang and Yates, 2012).…”
Section: Related Workmentioning
confidence: 99%
“…In such work, a word is represented by the distribution of other words that co-occur with it. Distributional representations of words have been successfully used in many language processing tasks such as entity set expansion (Pantel et al, 2009), part-of-speech (POS) tagging and chunking (Huang and Yates, 2009), ontology learning (Curran, 2005), computing semantic textual similarity (Besançon et al, 1999), and lexical inference (Kotlerman et al, 2012).…”
Section: Introductionmentioning
confidence: 99%