PerformanceNet: Score-to-audio music generation with multi-band convolutional residual network

Wang, Bryan; YI-HSUAN YANG

PerformanceNet: Score-to-audio music generation with multi-band convolutional residual network

Journal

33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019

ISBN

9781577358091

Date Issued

2019-01-01

Author(s)

Wang, Bryan

YI-HSUAN YANG

URI

https://scholars.lib.ntu.edu.tw/handle/123456789/636066

URL

https://api.elsevier.com/content/abstract/scopus_id/85074948947

Abstract

Music creation is typically composed of two parts: composing the musical score, and then performing the score with instruments to make sounds. While recent work has made much progress in automatic music generation in the symbolic domain, few attempts have been made to build an AI model that can render realistic music audio from musical scores. Directly synthesizing audio with sound sample libraries often leads to mechanical and deadpan results, since musical scores do not contain performance-level information, such as subtle changes in timing and dynamics. Moreover, while the task may sound like a text-to-speech synthesis problem, there are fundamental differences since music audio has rich polyphonic sounds. To build such an AI performer, we propose in this paper a deep convolutional model that learns in an end-to-end manner the score-to-audio mapping between a symbolic representation of music called the pianorolls and an audio representation of music called the spectrograms. The model consists of two subnets: the ContourNet, which uses a U-Net structure to learn the correspondence between pianorolls and spectrograms and to give an initial result; and the TextureNet, which further uses a multi-band residual network to refine the result by adding the spectral texture of overtones and timbre. We train the model to generate music clips of the violin, cello, and flute, with a dataset of moderate size. We also present the result of a user study that shows our model achieves higher mean opinion score (MOS) in naturalness and emotional expressivity than a WaveNet-based model and two off-the-shelf synthesizers.

Type

conference paper

PerformanceNet: Score-to-audio music generation with multi-band convolutional residual network

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)