Magali Sanches Duran scite author profile

Abstract. Levin-style classes which capture the shared syntax and semantics of verbs have proven useful for many Natural Language Processing (NLP) tasks and applications. However, lexical resources which provide information about such classes are only available for a handful of worlds languages. Because manual development of such resources is extremely time consuming and cannot reliably capture domain variation in classification, methods for automatic induction of verb classes from texts have gained popularity. However, to date such methods have been applied to English and a handful of other, mainly resource-rich languages. In this paper, we apply the methods to Brazilian Portuguese -a language for which no VerbNet or automatic class induction work exists yet. Since Levinstyle classification is said to have a strong cross-linguistic component, we use unsupervised clustering techniques similar to those developed for English without language-specific feature engineering. This yields interesting results which line up well with those obtained for other languages, demonstrating the crosslinguistic nature of this type of classification. However, we also discover and discuss issues which require specific consideration when aiming to optimise the performance of verb clustering for Brazilian Portuguese and other less-resourced languages.

A Lightweight Regression Method to Infer Psycholinguistic Properties for Brazilian Portuguese

Santos

Hartmann

et al. 2017

Using Cross-Linguistic Knowledge to Build VerbNet-Style Lexicons: Results for a (Brazilian) Portuguese VerbNet

Scarton

Aluísio

2014

Automatic Generation of a Lexical Resource to support Semantic Role Labeling in Portuguese

Duran¹,

Aluísio²

2015

This paper reports an approach to automatically generate a lexical resource to support incremental semantic role labeling annotation in Portuguese. The data come from the corpus Propbank-Br (Propbank of Brazilian Portuguese) and from the lexical resource of English Propbank, as both share the same structure. In order to enable the strategy, we added extra annotation to Propbank-Br. This approach is part of a previous decision to invert the process of implementing a Propbank project, by first annotating a core corpus and only then generating a lexical resource to enable further annotation tasks. The reasoning behind such inversion is to explore the task empirically before distributing the annotation task and to provide simultaneously: 1) a first training corpus for SRL in Brazilian Portuguese and 2) annotated examples to compose a lexical resource to support SRL. The main contribution of this paper is to point out to what extent linguistic effort may be reduced, thereby speeding up the construction of a lexical resource to support SRL for less resourced languages. The corpus Propbank-Br, with the extra annotation described herein, is publicly available.

A Normalizer for UGC in Brazilian Portuguese

Duran¹,

Nunes²,

Avanço

2015

User-generated contents (UGC) represent an important source of information for governments, companies, political candidates and consumers. However, most of the Natural Language Processing tools and techniques are developed from and for texts of standard language, and UGC is a type of text especially full of creativity and idiosyncrasies, which represents noise for NLP purposes. This paper presents UGCNormal, a lexicon-based tool for UGC normalization. It encompasses a tokenizer, a sentence segmentation tool, a phonetic-based speller and some lexicons, which were originated from a deep analysis of a corpus of product reviews in Brazilian Portuguese. The normalizer was evaluated in two different data sets and carried out from 31% to 89% of the appropriate corrections, depending on the type of text noise. The use of UGCNormal was also validated in a task of POS tagging, which improved from 91.35% to 93.15% in accuracy and in a task of opinion classification, which improved the average of F1-score measures (F1-score positive and F1-score negative) from 0.736 to 0.758.

Lexicografia pedagógica: atores e interfaces

Xatara

2007

DELTA

Foreign learners dictionaries have been brought out in the last thirty years and fill a need for pedagogical works that lexicographers ignored for a long time. This is a promising market, as it offers several possibilities of innovations. To better understand pedagogical lexicography, we propose in this article to identify the major actors and interfaces that affect its production. Such an exposition aims to arouse interest from Brazilian researchers about pedagogical lexicography and indirectly promote national production of pedagogic dictionaries. KEY-WORDS: pedagogical lexicography; bilingual lexicography; pedagogical bilingual dictionary.RESUMO: Os dicionários para aprendizes de línguas estrangeiras são obras relativamente recentes no mercado editorial e atendem uma demanda por obras pedagógicas que foi ignorada pelos lexicógrafos durante muito tempo. Esse segmento do mercado lexicográfico vem se revelando promissor, com diversas possibilidades de inovações. A fim de delinear o campo de desenvolvimento da Lexicografia Pedagógica, este artigo analisa-a sob duas perspectivas: a de suas interfaces com outras áreas da Lingüística e a de seus atores, pessoas cuja atuação influencia a produção de suas obras. Esta exposição tem o objetivo de despertar o interesse de pesquisadores brasileiros pela Lexicografia Pedagógi-ca e, indiretamente, promover a produção nacional de dicionários pedagógicos. PALAVRAS-CHAVE: lexicografia pedagógica; lexicografia bilíngüe; dicionário bilín-güe pedagógico.

Porttinari - a Large Multi-genre Treebank for Brazilian Portuguese

Pardo

Lopes

et al. 2021

This paper presents the project of a large multi-genre treebank for Brazilian Portuguese, called Porttinari. We address relevant research questions in its construction and annotation, reporting the work already done. The treebank is affiliated with the “Universal Dependencies” international model, widely adopted in the area, and must be the basis for the development of state of the art tagging and parsing systems for Portuguese, as well as for conducting linguistic studies on morphosyntax and syntax for this language.

Some Issues on the Normalization of a Corpus of Products Reviews in Portuguese

Avanço

Aluísio

et al. 2014