Many natural language researchers are currently turning their attention to treebank development and trying to achieve accuracy and corpus data coverage in their representation formats. This paper presents a data-driven annotation schema developed for an Italian treebank ensuring data coverage and consistency between annotation of linguistic phenomena. The schema is a dependency-based format centered upon the notion of predicate-argument structure augmented with traces to represent discontinuous constituents. The treebank development involves an annotation process performed by a human annotator helped by an interactive parsing tool that builds incrementally syntactic representation of the sentence. To increase the syntactic knowledge of this parser, a specific data-driven strategy has been applied. We describe the cyclical development of the annotation schema highlighting the richness and flexibility of the format, and we present some representational issues.
Vector semantics has slightly become a key tool for Natural Language Processing, especially concerning text analysis. This kind of vector representation is usually encoded through embeddings that can be used to encode semantic information at different levels of granularity. In fact, through the years, not only models for word embeddings have been developed, but also for sentence and documents. With this work we address sentence embeddings, in particular the non-parametric ones, which offer a good trade off between performance and inference speed. We present Static Fuzzy Bag-of-Word (SFBoW) model, a refinement of the Fuzzy Bag-of-Words approach yielding fixed-dimension sentence embeddings. We targeted fixed size embeddings to promote caching a re-usability, speeding the inference of a system that relies on our model. In this paper we explore various approaches for the construction of a static universe matrix, fundamental to make the sentence embeddings of fixed size. To show the validity of our approach, we benchmarked our model on a semantic similarity task, obtaining competitive performances.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.