Mandarin Singing Voice Synthesis with a Phonology-based Duration Model

Yang, Fu RongFu RongYangCho, Yin PingYin PingChoYI-HSUAN YANGWu, Da YiDa YiWuWu, Shan HungShan HungWuLiu, Yi WenYi WenLiu2023-10-062023-10-062021-01-019789881476890https://scholars.lib.ntu.edu.tw/handle/123456789/635990Singing voice synthesis (SVS) systems are built to generate human-like voice signals from lyrics and the corresponding musical scores. In most SVS systems, a neural network-based auxiliary duration model is employed to control the duration of phonemes. In this paper, a rule-based algorithm inspired by Mandarin phonology is proposed for the duration modeling in Mandarin SVS. Specifically, the algorithm infers the duration of an 'initial' consonant by looking up syllables in an existing training set that begin with the same consonant and have similar note lengths, and then computing the average consonant duration. Around this, we employ a combination of Tacotron2 and Parallel WaveGAN as the backbone of our SVS system for their favorable data efficiency on small datasets. Experimental results show that the singing voice synthesized by the proposed duration model is more expressive than that of a learning-based model. Moreover, since Mandarin is a tonal language, the inclusion of tonality consideration further enhances the naturalness of the generated voices.Mandarin Singing Voice Synthesis with a Phonology-based Duration Modelconference paper2-s2.0-85126714751https://api.elsevier.com/content/abstract/scopus_id/85126714751