Proceedings of the 11th International Conference on Parsing Technologies - IWPT '09 2009
DOI: 10.3115/1697236.1697263
|View full text |Cite
|
Sign up to set email alerts
|

Improving generative statistical parsing with semi-supervised word clustering

Abstract: To cite this version:Marie Candito, Benoît Crabbé. Improving generative statistical parsing with semi-supervised word clustering. Association for Computational Linguistics. AbstractWe present a semi-supervised method to improve statistical parsing performance. We focus on the well-known problem of lexical data sparseness and present experiments of word clustering prior to parsing. We use a combination of lexiconaided morphological clustering that preserves tagging ambiguity, and unsupervised word clustering, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
20
0
1

Year Published

2012
2012
2018
2018

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 21 publications
(22 citation statements)
references
References 8 publications
1
20
0
1
Order By: Relevance
“…There are several categories of related approaches, including those that learn a single embedding for unseen words (Søgaard and Johannsen, 2012;Chen and Manning, 2014;Collobert et al, 2011), those that use character-level information (Luong et al, 2013;Botha and Blunsom, 2014;, those using morphological and n-gram information (Candito and Crabbé, 2009;Habash, 2009;Marton et al, 2010;Seddah et al, 2010;Attia et al, 2010;Bansal and Klein, 2011;Keller and Lapata, 2003), and hybrid approaches Jean et al, 2015;Luong et al, 2015;Chitnis and DeNero, 2015). The representation for the unknown token is either learned specifically or computed from a selection of rare words, for example by averaging their embedding vectors.…”
Section: Related Workmentioning
confidence: 99%
“…There are several categories of related approaches, including those that learn a single embedding for unseen words (Søgaard and Johannsen, 2012;Chen and Manning, 2014;Collobert et al, 2011), those that use character-level information (Luong et al, 2013;Botha and Blunsom, 2014;, those using morphological and n-gram information (Candito and Crabbé, 2009;Habash, 2009;Marton et al, 2010;Seddah et al, 2010;Attia et al, 2010;Bansal and Klein, 2011;Keller and Lapata, 2003), and hybrid approaches Jean et al, 2015;Luong et al, 2015;Chitnis and DeNero, 2015). The representation for the unknown token is either learned specifically or computed from a selection of rare words, for example by averaging their embedding vectors.…”
Section: Related Workmentioning
confidence: 99%
“…Ghayoomi (2012) and Ghayoomi et al (2014) created clusters using word and POS information to resolve homograph issues in Persian and Bulgarian respectively, significantly improving results for lexicalized word-based parsing. Candito and Crabbé (2009) clustered on desinflected words, removing unnecessary inflection markers using an external lexicon, subsequently combining this form with additional features. This improved results for unlexicalzed PCFG-LA parsing for both medium and higher frequency words , but was comparable to clustering the lemma with its predicted POS tag.…”
Section: Word Clusteringmentioning
confidence: 99%
“…The French Social Media Bank developed by Seddah et al (2012) is a treebank of 1,700 French sentences from various type of social media including Facebook, Twitter and discussion forums (video game and medical). An extended version of the FTB-UC annotation guidelines (Candito and Crabbé, 2009) is employed during annotation and subcorpora containing particularly noisy utterances are identified.…”
Section: Related Workmentioning
confidence: 99%
“…3 The English annotators were guided by the Penn Treebank bracketing guidelines and a Foreebank-adapted version of the English Web Treebank bracketing guidelines. The French annotators used the French treebank (FTB) (Abeillé et al, 2003) guidelines, following the SPMRL strategy for multiword expressions (Seddah et al, 2013;Candito and Crabbé, 2009). The two primary annotators, one for French and one for English, annotated all the data for their language.…”
Section: Building the Foreebankmentioning
confidence: 99%