Zero-Shot Singing Voice Synthesis from Musical Score
Journal
2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
ISBN
9798350306897
Date Issued
2023-01-01
Author(s)
Abstract
Zero-shot singing voice synthesis (SVS), the task to synthesize the singing voice of an arbitrary target singer, has gained increasing attentions in the past few years. Several recently proposed systems have demonstrated promising results on this task. However, these systems require detailed musical features at the frame level as the musical content. To deal with this issue, we propose a model that performs zero-shot SVS with only musical score as the musical content condition. To help model training, we build an acoustic encoder that extracts linguistic features from audio, and train it with the lyrics transcription objective. The output of the acoustic encoder serves as an alternative to the musical score, allowing the SVS model to learn from weakly labeled data. Results suggest that the proposed method outperforms baseline semi-supervised method in both subjective and objective tests.
Subjects
semi-weakly-supervised learning | Singing voice synthesis | zero-shot
SDGs
Type
conference paper
