Wu, Da YiDa YiWuYI-HSUAN YANG2023-10-112023-10-112020-01-012308457Xhttps://scholars.lib.ntu.edu.tw/handle/123456789/636024This paper investigates the use of generative adversarial network (GAN)-based models for converting a speech signal into a singing one, without reference to the phoneme sequence underlying the speech. This is achieved by viewing speech-to-singing conversion as a style transfer problem. Specifically, given a speech input, and the F0 contour of the target singing output, the proposed model generates the spectrogram of a singing signal with a progressive-growing encoder/decoder architecture. Moreover, the model uses a boundary equilibrium GAN loss term such that it can learn from both paired and unpaired data. The spectrogram is finally converted into wave with a separate GAN-based vocoder. Our quantitative and qualitative analysis show that the proposed model generates singing voices with much higher naturalness than an existing non adversarially-trained baseline.Adversarial training | Encoder/decoder | Singing voice synthesis | Speech-to-singing conversion | Style transferSpeech-to-singing conversion based on boundary equilibrium ganconference paper10.21437/Interspeech.2020-19842-s2.0-85098225294https://api.elsevier.com/content/abstract/scopus_id/85098225294