Large language models may struggle to detect culturally embedded filicide-suicide risks.

CHENG-CHE CHENChen, Justin AJustin AChenLiang, Chih-SungChih-SungLiangLin, Yu-HsuanYu-HsuanLin2025-06-092025-06-092025-03https://scholars.lib.ntu.edu.tw/handle/123456789/729937This study examines the capacity of six large language models (LLMs)-GPT-4o, GPT-o1, DeepSeek-R1, Claude 3.5 Sonnet, Sonar Large (LLaMA-3.1), and Gemma-2-2b-to detect risks of domestic violence, suicide, and filicide-suicide in the Taiwanese flash fiction "Barbecue". The story, narrated by a six-year-old girl, depicts family tension and subtle cues of potential filicide-suicide through charcoal-burning, a culturally recognized method in Taiwan. Each model was tasked with interpreting the story's risks, with roles simulating different mental health expertise levels. Results showed that all models detected domestic violence; however, only GPT-o1, Claude 3.5 Sonnet and Sonar Large identified the risk of suicide based on cultural cues. GPT-4o, DeepSeek-R1 and Gemma-2-2b missed the suicide risk, interpreting the mother's isolation as merely a psychological response. Notably, none of the models comprehended the cultural context behind the mother sparing her daughter, reflecting a gap in LLMs' understanding of non-Western sociocultural nuances. These findings highlight the limitations of LLMs in addressing culturally embedded risks, essential for effective mental health assessments.enCharcoal-burning suicideCultural psychiatryDomestic violenceEast Asian cultureFilicide-suicideLarge language models[SDGs]SDG3[SDGs]SDG5[SDGs]SDG16Large language models may struggle to detect culturally embedded filicide-suicide risks.text::journal::journal article10.1016/j.ajp.2025.10439539955914