Benchmarking Natural Language Inference and Semantic Textual Similarity for Portuguese

Fialho, Pedro; Coheur, Luísa; Quaresma, Paulo

doi:10.3390/info11100484

Cited by 6 publications

(13 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For this purpose, TextBenDS proposes a tweet-based data model and two types of workloads, namely Top-K keywords and Top-K documents operations. Other purely textual benchmarks focus on language analysis tasks, e.g., Chinese [25] and Portuguese [5] text recognition, respectively.…”

Section: Textual Benchmarksmentioning

confidence: 99%

See 1 more Smart Citation

Benchmarking Data Lakes Featuring Structured and Unstructured Data with DLBench

Sawadogo,

Darmont

2021

Preprint

View full text Add to dashboard Cite

In the last few years, the concept of data lake has become trendy for data storage and analysis. Thus, several approaches have been proposed to build data lake systems. However, these proposals are difficult to evaluate as there are no commonly shared criteria for comparing data lake systems. Thus, we introduce DLBench, a benchmark to evaluate and compare data lake implementations that support textual and/or tabular contents. More concretely, we propose a data model made of both textual and CSV documents, a workload model composed of a set of various tasks, as well as a set of performance-based metrics, all relevant to the context of data lakes. As a proof of concept, we use DLBench to evaluate an open source data lake system we previously developed.

show abstract

Section: Textual Benchmarksmentioning

confidence: 99%

“…Thus, we provide instead a script that extracts a user-defined amount of documents. This script and a usage guide are available online for reuse 5 . Amongst all available documents in HAL, we restrict to scientific articles whose length is homogeneous, which amounts to 50,000 documents.…”

Section: Data Extractionmentioning

confidence: 99%

Benchmarking Data Lakes Featuring Structured and Unstructured Data with DLBench

Sawadogo,

Darmont

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The models developed during ASSIN 2 used more recent NLP approaches, including contextual embeddings like BERT [7]. Recent works addressing these datasets have since been continuously proposed, with the state-of-the-art being a BERT model pre-trained in Portuguese fine-tuned for STS [8].…”

Section: Introductionmentioning

confidence: 99%

“…The goal of this work is to evaluate contextual embeddings generated by SBERT models for STS in Portuguese, which we investigate in two stages. First, we compare the performance of pre-trained SBERT models with the state-ofthe-art BERT models for the ASSIN datasets [8]. In addition, we include other baseline models, such as the best-performing works assessed in the workshops and other multilingual contextual embeddings.…”

Section: Introductionmentioning

confidence: 99%

Comparing Contextual Embeddings for Semantic Textual Similarity in Portuguese

Andrade¹,

Cardoso-Silva²,

Bezerra³

2021

Intelligent Systems

View full text Add to dashboard Cite

Semantic textual similarity (STS) measures how semantically similar two sentences are. In the context of the Portuguese language, STS literature is still incipient but includes important initiatives like the ASSIN and ASSIN 2 shared tasks. The state-of-the-art for those datasets is a contextual embedding produced by a Portuguese pre-trained and fine-tuned BERT model. In this work, we investigate the application of Sentence-BERT (SBERT) contextual embeddings to these datasets. Compared to BERT, SBERT is a more computationally efficient approach, enabling its application to scalable unsupervised learning problems. Given the absence of SBERT models pre-trained in Portuguese and the computational cost for such training, we adopt multilingual models and also fine-tune them for Portuguese. Results showed that SBERT embeddings were competitive especially after fine-tuning, numerically surpassing the results of BERT on ASSIN 2 and the results observed during the shared tasks for all datasets considered.

show abstract

“…Resultados de avaliações intrínsecas de embeddings de palavras indicam que os vetores podem relacionar-se de diferentes formas, ora por assunto ("mouse" e "teclado"), ora por proximidade de uso ("leite" e "condensado"), ou não apresentarem relação de sentido aparente, o que deixa a dúvida se modelos que geram estas representações refletem algum tipo de relação de sentido de forma sistemática. Resultados de avaliações intrínsecas e extrínsecas também indicam que há diferenças significativas nos resultados obtidos a partir de diferentes datasets, bem como de diferentes modelos de geração de vetores (ANTONIAK; MIMNO, 2018;SINOARA;ROSSI;REZENDE, 2016;SCHNABEL et al, 2015;FIALHO;COHEUR;QUARESMA, 2020).…”

Section: Contexto E Motivaçãounclassified

Avaliação de representações embeddings para similaridade sentencial no Português

Rodrigues¹

View full text Add to dashboard Cite

A realização deste trabalho contou com o apoio e incentivos de diferentes formas. Às vezes uma palavra, às vezes uma explicação mais técnica, às vezes uma dica e às vezes uma comida quentinha que trazia ânimo. Foi a junção das pequenas coisas, muitas vezes feitas sem pretensão, que foram construindo o caminho, todas pelas quais sou muito grata.À minha mãe, que sempre esteve ao meu lado, mesmo quando fiz escolhas que fugiam ao senso comum e a quem seria impossível agradecer o suficiente. Ao meu pai, que com seu olhar analítico e perspicaz do mundo sempre me ajudou. À Nilzete, por todas as vezes que me acolheu em sua casa, de onde sempre saí mais feliz do que quando entrei. Aos meus sogros que tantas vezes me receberam e me hospedaram carinhosamente.Ao professor Ricardo Marcondes Marcacini, por sua orientação, pelas sugestões e críticas valiosas ao trabalho, pela confiança na minha capacidade, pela disposição amistosa mesmo frente a percalços e pelo total apoio. Ao professor Roberto Hirata Junior, por me orientar nos primeiros passos que dei na Computação, quando quase nada sabia e sem me julgar mal por eu vir de outra área, e aos professores Marcelo Ferreira e Marcos Lopes, por proporcionarem o primeiro contato que tive com a área de Linguística Computacional, além das excelentes aulas as quais até hoje faço referências mentais.Aos docentes do Núcleo Interinstitucional de Linguística Computacional (NILC) e do Laboratório de Inteligência Computacional (LABIC) que diretamente e indiretamente me ajudaram, em particular a professora Graça Nunes, com quem dei o pontapé inicial neste trabalho.Também aos amigos que fiz no NILC e no LABIC. Sou grata e feliz pelos momentos que compartilhamos, dentro e fora do laboratório, trabalhando e rindo.Além da companhia, agradeço em especial pelas recomendações e momentos de troca acadêmica com

show abstract

Benchmarking Natural Language Inference and Semantic Textual Similarity for Portuguese

Cited by 6 publications

References 23 publications

Benchmarking Data Lakes Featuring Structured and Unstructured Data with DLBench

Benchmarking Data Lakes Featuring Structured and Unstructured Data with DLBench

Comparing Contextual Embeddings for Semantic Textual Similarity in Portuguese

Avaliação de representações embeddings para similaridade sentencial no Português

Contact Info

Product

Resources

About