Combining compound recognition and PCFG-LA parsing with word lattices and conditional random fields

Constant, Matthieu; Roux, Joseph Le; Sigogne, Anthony

doi:10.1145/2483969.2483970

Cited by 13 publications

(7 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In sequence tagging MWEI methods, such resources can be used as sources of lexical features (Schneider et al, 2014). In parsing-based approaches they may serve as a basis for word-lattice representation of an input sentence, in which the compositional vs. MWE interpretation of a word sequence is represented jointly (Constant et al, 2013). The impact of lexical resources on MWEI is explicitly addressed by Riedl and Biemann (2016).…”

Section: Mwe Lexicons In Mwe Identificationmentioning

confidence: 99%

Without lexicons, multiword expression identification will never fly: A position statement

Savary¹,

Cordeiro²,

Ramisch³

2019

Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

View full text Add to dashboard Cite

Because most multiword expressions (MWEs), especially verbal ones, are semantically non-compositional, their automatic identification in running text is a prerequisite for semantically-oriented downstream applications. However, recent developments, driven notably by the PARSEME shared task on automatic identification of verbal MWEs, show that this task is harder than related tasks, despite recent contributions both in multilingual corpus annotation and in computational models. In this paper, we analyse possible reasons for this state of affairs. They lie in the nature of the MWE phenomenon, as well as in its distributional properties. We also offer a comparative analysis of the state-of-the-art systems, which exhibit particularly strong sensitivity to unseen data. On this basis, we claim that, in order to make strong headway in MWE identification, the community should bend its mind into coupling identification of MWEs with their discovery, via syntactic MWE lexicons. Such lexicons need not necessarily achieve a linguistically complete modelling of MWEs' behavior, but they should provide minimal morphosyntactic information to cover some potential uses, so as to complement existing MWE-annotated corpora. We define requirements for such a minimal NLP-oriented lexicon, and we propose a roadmap for the MWE community driven by these requirements.

show abstract

Section: Mwe Lexicons In Mwe Identificationmentioning

confidence: 99%

Without lexicons, multiword expression identification will never fly: A position statement

Savary¹,

Cordeiro²,

Ramisch³

2019

Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

View full text Add to dashboard Cite

show abstract

“…For a more complete survey on phraseology discovery, the different proposed methods and their performances, we refer to Evert (2004); Pecina (2008); Manning and Schütze (1999); McKeown and Radev (1999); Baldwin and Kim (2010); Seretan (2011); Ramisch (2015). In addition to monolingual discovery, other tasks have also been investigated in computational linguistics, such as bilingual phraseology discovery (Ha et al, 2008;Morin and Daille, 2010;Weller and Heid, 2012;Rivera et al, 2013), automatic interpretation and disambiguation of multiword expressions (Fazly et al, 2009) and their integration into applications such as parsing (Constant et al, 2013) and machine translation (Carpuat and Diab, 2010). For further reading, we recommend the proceedings of the annual workshop on multiword expressions (Markantonatou et al, 2017), 3 as well as journal special issues on the topic (Villavicencio et al, 2005;Rayson et al, 2010;Bond et al, 2013;Ramisch et al, 2013).…”

Section: Computational Phraseology Discoverymentioning

confidence: 99%

Computational phraseology discovery in corpora with the MWETOOLKIT

Ramisch¹

2020

IVITRA Research in Linguistics and Literature

View full text Add to dashboard Cite

Computer tools can help discovering new phraseological units in corpora, thanks to their ability to quickly draw statistics from large amounts of textual data. While the research community has focused on developing and evaluating original algorithms for the automatic discovery of phraseological units, little has been done to transform these sophisticated methods into usable software. In this chapter, we present a brief survey of the main approaches to computational phraseology available. Furthermore, we provide worked out examples of how to apply these methods using the mwetoolkit, a free software for the discovery and identification of multiword expressions. The usefulness of the automatically extracted units depends on various factors such as language, corpus size, target units, and available taggers and parsers. Nonetheless, the mwetoolkit allows fine-grained tuning so that this variability is taken into account, adapting the tool to the specificities of each lexicographic environment. Résumé Les outils informatiques peuvent assister la découverte de nouvelles unités phraséologiques dans les corpus grâce à leur facilité pour calculer rapidement des statistiques à partir de grands volumes de données textuelles. Alors que la communauté de recherche s'est concentrée sur le développement et l'évaluation d'algorithmes originaux pour la découverte automatique d'unités phraséologiques, la transformation de ces méthodes sophistiquées en logiciels utilisables est souvent ignorée. Ce chapitre présente un bref résumé des principales approches informatiques disponibles pour la découverte d'unités phraséologiques. Nous présenterons des exemples détaillés de l'application de ces approches avec le mwetoolkit, un logiciel libre pour la découverte et l'identification d'unités polylexicales. L'utilité des unités extraites automatiquement dépend de plusieurs facteurs comme la langue, la taille du corpus, les unités cibles, et les étiqueteurs et analyseurs disponibles. Néanmoins, le mwetoolkit permet un paramétrage fin, de manière à ce que cette variabilité soit prise en compte dans l'adaptation de l'outil à chaque environnement lexicographique.

show abstract

“…Rule-based matching, supervised classification, sequence tagging, and parsing are among the most popular models for MWE identification (Constant et al, 2017). Parsing-based methods take the (recursive) structure of language into account, trying to identify MWEs as a by-product of parsing Constant et al, 2013), or jointly (Constant and Nivre, 2016). Sequence tagging models, on the other hand, consider only linear context, using models such as CRFs (Vincze et al, 2011;Shigeto et al, 2013;Riedl and Biemann, 2016) and averaged perceptron (Schneider et al, 2014) combined with some variant of begin-inside-outside (BIO) encoding (Ramshaw and Marcus, 1995).…”

Section: Related Workmentioning

confidence: 99%

“…For many years, MWE identification was considered unrealistic, with most MWE research focusing on out-of-context MWE discovery (Ramisch et al, 2013). Indeed, the availability of MWE-annotated corpora was limited to some treebanks with partial annotations, often a by-product of syntax trees Constant et al, 2013). This prevented the widespread development and evaluation of MWE identification systems, as compared to other tasks such as POS tagging and named entity recognition.…”

Section: Introductionmentioning

confidence: 99%

The Impact of Word Representations on Sequential Neural MWE Identification

Zampieri¹,

Ramisch²,

Damnati³

2019

Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

View full text Add to dashboard Cite

Recent initiatives such as the PARSEME shared task have allowed the rapid development of MWE identification systems. Many of those are based on recent NLP advances, using neural sequence models that take continuous word representations as input. We study two related questions in neural verbal MWE identification: (a) the use of lemmas and/or surface forms as input features, and (b) the use of word-based or character-based embeddings to represent them. Our experiments on Basque, French, and Polish show that character-based representations yield systematically better results than word-based ones. In some cases, character-based representations of surface forms can be used as a proxy for lemmas, depending on the morphological complexity of the language.

show abstract

Combining compound recognition and PCFG-LA parsing with word lattices and conditional random fields

Cited by 13 publications

References 18 publications

Without lexicons, multiword expression identification will never fly: A position statement

Without lexicons, multiword expression identification will never fly: A position statement

Computational phraseology discovery in corpora with the MWETOOLKIT

The Impact of Word Representations on Sequential Neural MWE Identification

Contact Info

Product

Resources

About