Proceedings of the 20th Workshop on Biomedical Language Processing 2021
DOI: 10.18653/v1/2021.bionlp-1.13
|View full text |Cite
|
Sign up to set email alerts
|

Stress Test Evaluation of Biomedical Word Embeddings

Abstract: The success of pretrained word embeddings has motivated their use in the biomedical domain, with contextualized embeddings yielding remarkable results in several biomedical NLP tasks. However, there is a lack of research on quantifying their behavior under severe "stress" scenarios. In this work, we systematically evaluate three language models with adversarial examples -automatically constructed tests that allow us to examine how robust the models are. We propose two types of stress scenarios focused on the b… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
1
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 27 publications
0
2
0
Order By: Relevance
“…Goodfellow et al (2015) found that when small but intentionally worst‐case perturbations are applied to the input to generate “adversarial examples,” the model can output an incorrect answer with high confidence. Such scenarios are also prevalent in medical AI models in both medical imaging tasks and NLP tasks (Araujo et al, 2020; Ozbulak et al, 2019). In this case, it is difficult to gauge AI's frontier for generalizability and robustness.…”
Section: Key Knowledge Gaps: Enabling Productive Teaming Between Ai A...mentioning
confidence: 99%
“…Goodfellow et al (2015) found that when small but intentionally worst‐case perturbations are applied to the input to generate “adversarial examples,” the model can output an incorrect answer with high confidence. Such scenarios are also prevalent in medical AI models in both medical imaging tasks and NLP tasks (Araujo et al, 2020; Ozbulak et al, 2019). In this case, it is difficult to gauge AI's frontier for generalizability and robustness.…”
Section: Key Knowledge Gaps: Enabling Productive Teaming Between Ai A...mentioning
confidence: 99%
“…Following this approach, SentEval (Conneau and Kiela, 2018) and DiscoEval (Chen et al, 2019) include tasks at the sentence and discourse level. The third focuses on stress tests (Naik et al, 2018;Aspillaga et al, 2020;Araujo et al, 2021a) that seek to assess the ability of language models to adapt to cases designed to confuse them. The fourth objective is an evaluation from a linguistic perspective (Warstadt et al, 2019;Ettinger, 2020;Puccetti et al, 2021) to elucidate the models' actual linguistic capacities or knowledge.…”
Section: Language Model Evaluationsmentioning
confidence: 99%