Speech-to-singing conversion based on boundary equilibrium gan
Journal
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Journal Volume
2020-October
Date Issued
2020-01-01
Author(s)
Wu, Da Yi
Abstract
This paper investigates the use of generative adversarial network (GAN)-based models for converting a speech signal into a singing one, without reference to the phoneme sequence underlying the speech. This is achieved by viewing speech-to-singing conversion as a style transfer problem. Specifically, given a speech input, and the F0 contour of the target singing output, the proposed model generates the spectrogram of a singing signal with a progressive-growing encoder/decoder architecture. Moreover, the model uses a boundary equilibrium GAN loss term such that it can learn from both paired and unpaired data. The spectrogram is finally converted into wave with a separate GAN-based vocoder. Our quantitative and qualitative analysis show that the proposed model generates singing voices with much higher naturalness than an existing non adversarially-trained baseline.
Subjects
Adversarial training | Encoder/decoder | Singing voice synthesis | Speech-to-singing conversion | Style transfer
Type
conference paper