Evaluation of performance of generative large language models for stroke care
Journal
NPJ Digital Medicine
Journal Volume
8
Journal Issue
1
Start Page
481
ISSN
2398-6352
Date Issued
2025-07-29
Author(s)
Li, Vincent Cheng-Sheng
Wu, Jia-Jyun
Chen, Hsiao-Hui
Su, Sophia Sin-Yu
Chang, Brian Pin-Hsuan
Lai, Richard Lee
Liu, Chi-Hung
Chen, Chung-Ting
Tanapima, Valis
Shen, Toby Kai-Bo
Atun, Rifat
Abstract
Stroke is a leading cause of global morbidity and mortality, disproportionately impacting lower socioeconomic groups. In this study, we evaluated three generative LLMs-GPT, Claude, and Gemini-across four stages of stroke care: prevention, diagnosis, treatment, and rehabilitation. Using three prompt engineering techniques-Zero-Shot Learning (ZSL), Chain of Thought (COT), and Talking Out Your Thoughts (TOT)-we applied each to realistic stroke scenarios. Clinical experts assessed the outputs across five domains: (1) accuracy; (2) hallucinations; (3) specificity; (4) empathy; and (5) actionability, based on clinical competency benchmarks. Overall, the LLMs demonstrated suboptimal performance with inconsistent scores across domains. Each prompt engineering method showed strengths in specific areas: TOT does well in empathy and actionability, COT was strong in structured reasoning during diagnosis, and ZSL provided concise, accurate responses with fewer hallucinations, especially in the Treatment stage. However, none consistently met high clinical standards across all stroke care stages.
Publisher
Springer Science and Business Media LLC
Type
journal article
