Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments

Brendin R Beaulieu-Jones; Sahaj Shah; Margaret T Berrigan; Jayson S Marwaha; SHUO-LUN LAI; Gabriel A Brat

doi:10.1101/2023.07.16.23292743

Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments

Date Issued

2023-07-19

Author(s)

Brendin R Beaulieu-Jones

Sahaj Shah

Margaret T Berrigan

Jayson S Marwaha

SHUO-LUN LAI

Gabriel A Brat

DOI

10.1101/2023.07.16.23292743

URI

https://scholars.lib.ntu.edu.tw/handle/123456789/723209

Abstract

Background: Artificial intelligence (AI) has the potential to dramatically alter healthcare by enhancing how we diagnosis and treat disease. One promising AI model is ChatGPT, a large general-purpose language model trained by OpenAI. The chat interface has shown robust, human-level performance on several professional and academic benchmarks. We sought to probe its performance and stability over time on surgical case questions. Methods: We evaluated the performance of ChatGPT-4 on two surgical knowledge assessments: the Surgical Council on Resident Education (SCORE) and a second commonly used knowledge assessment, referred to as Data-B. Questions were entered in two formats: open-ended and multiple choice. ChatGPT output were assessed for accuracy and insights by surgeon evaluators. We categorized reasons for model errors and the stability of performance on repeat encounters. Results: A total of 167 SCORE and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71% and 68% of multiple-choice SCORE and Data-B questions, respectively. For both open-ended and multiple-choice questions, approximately two-thirds of ChatGPT responses contained non-obvious insights. Common reasons for inaccurate responses included: inaccurate information in a complex question (n=16, 36.4%); inaccurate information in fact-based question (n=11, 25.0%); and accurate information with circumstantial discrepancy (n=6, 13.6%). Upon repeat query, the answer selected by ChatGPT varied for 36.4% of inaccurate questions; the response accuracy changed for 6/16 questions. Conclusion: Consistent with prior findings, we demonstrate robust near or above human-level performance of ChatGPT within the surgical domain. Unique to this study, we demonstrate a substantial inconsistency in ChatGPT responses with repeat query. This finding warrants future consideration and presents an opportunity to further train these models to provide safe and consistent responses. Without mental and/or conceptual models, it is unclear whether language models such as ChatGPT would be able to safely assist clinicians in providing care. The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.

Subjects

artificial intelligence

ChatGPT

language models

surgery

surgical education

SDGs

[SDGs]SDG4

Publisher

Cold Spring Harbor Laboratory

Type

other

Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)