JOHN TAYU LEELi, Vincent Cheng-ShengVincent Cheng-ShengLiWu, Jia-JyunJia-JyunWuChen, Hsiao-HuiHsiao-HuiChenSu, Sophia Sin-YuSophia Sin-YuSuChang, Brian Pin-HsuanBrian Pin-HsuanChangLai, Richard LeeRichard LeeLaiLiu, Chi-HungChi-HungLiuChen, Chung-TingChung-TingChenTanapima, ValisValisTanapimaShen, Toby Kai-BoToby Kai-BoShenAtun, RifatRifatAtun2025-12-302025-12-302025-07-29https://scholars.lib.ntu.edu.tw/handle/123456789/734830Stroke is a leading cause of global morbidity and mortality, disproportionately impacting lower socioeconomic groups. In this study, we evaluated three generative LLMs-GPT, Claude, and Gemini-across four stages of stroke care: prevention, diagnosis, treatment, and rehabilitation. Using three prompt engineering techniques-Zero-Shot Learning (ZSL), Chain of Thought (COT), and Talking Out Your Thoughts (TOT)-we applied each to realistic stroke scenarios. Clinical experts assessed the outputs across five domains: (1) accuracy; (2) hallucinations; (3) specificity; (4) empathy; and (5) actionability, based on clinical competency benchmarks. Overall, the LLMs demonstrated suboptimal performance with inconsistent scores across domains. Each prompt engineering method showed strengths in specific areas: TOT does well in empathy and actionability, COT was strong in structured reasoning during diagnosis, and ZSL provided concise, accurate responses with fewer hallucinations, especially in the Treatment stage. However, none consistently met high clinical standards across all stroke care stages.enEvaluation of performance of generative large language models for stroke carejournal article10.1038/s41746-025-01830-940730644