Beyond Human Understanding: Benchmarking Language Models for Polish Cariology Expertise

Wojcik, Simona; Rulkiewicz, Anna; Pruszczyk, Piotr; Lisik, Wojciech; Poboży, Marcin; Pilchowska, Iwona; Domienik-Karlowicz, Justyna

doi:10.20944/preprints202309.1100.v1

Cited by 1 publication

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Questions, along with their multiple-choice answers, were presented to the model followed by the instruction, 'Give the number of the best answer. Start your response with "The answer is:"' The goal of this approach was to have the LLM respond with just the multiple-choice answer (1)(2)(3)(4)(5) and not provide a lengthy (costly) explanation.…”

Section: Ai Prompting Methodologymentioning

confidence: 99%

“…One prominent illustration of this is the Generative Pre-Trained Transformer (GPT), released by Open AI in 2018 [1]. GPT 4.0 has proven remarkable ability in assessing knowledge in specialised domains such as medicine, law, and business [2][3][4]-areas that have historically been the exclusive purview of professionals. Particularly noteworthy is its exceptional performance on assessments like the Korean general surgery board exam, the United States Medical Licensing Exam, and the Wharton MBA final exam, each achieved without the finetuning of the pretrained model [5][6][7].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Artificial intelligence model GPT4 narrowly fails simulated radiological protection exam

Roemer,

Li,

Mahmood

et al. 2024

J. Radiol. Prot.

View full text Add to dashboard Cite

This study assesses the efficacy of Generative Pre-Trained Transformers (GPT) published by OpenAI in the specialized domains of radiological protection and health physics. Utilizing a set of 1064 surrogate questions designed to mimic a health physics certification exam, we evaluated the models' ability to accurately respond to questions across five knowledge domains. Our results indicated that neither model met the 67% passing threshold, with GPT-3.5 achieving a 45.3% weighted average and GPT-4 attaining 61.7%. Despite GPT-4's significant parameter increase and multimodal capabilities, it demonstrated superior performance in all categories yet still fell short of a passing score. The study's methodology involved a simple, standardized prompting strategy without employing prompt engineering or in-context learning, which are known to potentially enhance performance. The analysis revealed that GPT-3.5 formatted answers more correctly, despite GPT-4's higher overall accuracy. The findings suggest that while GPT-3.5 and GPT-4 show promise in handling domain-specific content, their application in the field of radiological protection should be approached with caution, emphasizing the need for human oversight and verification.

show abstract