Stress Test Evaluation of Biomedical Word Embeddings

Araujo, Vladimir; Carvallo, Andrés; Aspillaga, Carlos; Thorne, Camilo; Parra, Denis

doi:10.18653/v1/2021.bionlp-1.13

Cited by 4 publications

(2 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Goodfellow et al (2015) found that when small but intentionally worst‐case perturbations are applied to the input to generate “adversarial examples,” the model can output an incorrect answer with high confidence. Such scenarios are also prevalent in medical AI models in both medical imaging tasks and NLP tasks (Araujo et al, 2020; Ozbulak et al, 2019). In this case, it is difficult to gauge AI's frontier for generalizability and robustness.…”

Section: Key Knowledge Gaps: Enabling Productive Teaming Between Ai A...mentioning

confidence: 99%

Augmenting physicians with artificial intelligence to transform healthcare: Challenges and opportunities

Agarwal,

Dugas,

Gao

2023

Economics Manag Strategy

View full text Add to dashboard Cite

We reflect on the progress and prospects of artificial intelligence (AI)‐powered transformation in healthcare from the perspective of front‐line clinical professionals responsible for care delivery. While there is considerable optimism about the potential of AI, critical gaps in understanding remain that represent fruitful opportunities for economics and management scholars. We outline the ways in which the strengths of AI can compensate for key limitations of physicians. We then focus on productive use of AI by physicians, highlighting the need for a deeper understanding of human‐AI teaming. We argue that productive teaming requires research on two critical issues: trust in AI and the redesign of clinical workflow to optimally accommodate artificial and human intelligence synergistically.

show abstract

Section: Key Knowledge Gaps: Enabling Productive Teaming Between Ai A...mentioning

confidence: 99%

Augmenting physicians with artificial intelligence to transform healthcare: Challenges and opportunities

Agarwal,

Dugas,

Gao

2023

Economics Manag Strategy

View full text Add to dashboard Cite

show abstract

“…Following this approach, SentEval (Conneau and Kiela, 2018) and DiscoEval (Chen et al, 2019) include tasks at the sentence and discourse level. The third focuses on stress tests (Naik et al, 2018;Aspillaga et al, 2020;Araujo et al, 2021a) that seek to assess the ability of language models to adapt to cases designed to confuse them. The fourth objective is an evaluation from a linguistic perspective (Warstadt et al, 2019;Ettinger, 2020;Puccetti et al, 2021) to elucidate the models' actual linguistic capacities or knowledge.…”

Section: Language Model Evaluationsmentioning

confidence: 99%

Evaluation Benchmarks for Spanish Sentence Representations

Araujo¹,

Carvallo²,

Kundu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Due to the success of pre-trained language models, versions of languages other than English have been released in recent years. This fact implies the need for resources to evaluate these models. In the case of Spanish, there are few ways to systematically assess the models' quality. In this paper, we narrow the gap by building two evaluation benchmarks. Inspired by previous work (Conneau and Kiela, 2018;Chen et al., 2019), we introduce Spanish SentEval and Spanish DiscoEval, aiming to assess the capabilities of stand-alone and discourse-aware sentence representations, respectively. Our benchmarks include considerable pre-existing and newly constructed datasets that address different tasks from various domains. In addition, we evaluate and analyze the most recent pre-trained Spanish language models to exhibit their capabilities and limitations.As an example, we discover that for the case of discourse evaluation tasks, mBERT, a language model trained on multiple languages, usually provides a richer latent representation than models trained only with documents in Spanish. We hope our contribution will motivate a fairer, more comparable, and less cumbersome way to evaluate future Spanish language models.

show abstract