OMPAL: Bridging Speech and Learning with an Open-Source Mandarin Pronunciation Assessment Corpus for Global Learners

Hsieh, Wen-WeiWen-WeiHsiehChi, Hao-WeiHao-WeiChiWang, Kuan-ChenKuan-ChenWangYeh, Ping-ChengPing-ChengYehChiang, Chen-YuChen-YuChiangTE-HSIN LIU2025-11-242025-11-242025-08-172308457Xhttps://www.scopus.com/record/display.uri?eid=2-s2.0-105020067660&origin=resultslisthttps://scholars.lib.ntu.edu.tw/handle/123456789/734058This paper introduces OMPAL, a new open-source Mandarin corpus specifically designed for non-native pronunciation assessment. This corpus comprises 1,768 Mandarin utterances from French L1 speakers learning Mandarin, each meticulously annotated by four experts with professional Mandarin teaching experience at both the word and sentence levels. We also provide a manual scoring system to assist researchers in constructing related corpora. Furthermore, a baseline model for pronunciation assessment, which is publicly accessible, is provided alongside our corpus. The OMPAL corpus, available for commercial and non-commercial use, is designed to support and enhance speech research across various applications. We believe that OMPAL will be a valuable resource for the speech research community.falsecomputer-aided pronunciation training (CAPT)corpusdeep learningMandarinsecond language (L2)[SDGs]SDG4OMPAL: Bridging Speech and Learning with an Open-Source Mandarin Pronunciation Assessment Corpus for Global Learnersconference paper10.21437/interspeech.2025-9832-s2.0-105020067660