Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs

Lin, Lien-Chiao

doi:10.6342/NTU201602167

Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs

Date Issued

2016

Date

2016

Author(s)

Lin, Lien-Chiao

DOI

10.6342/NTU201602167

URI

http://ntur.lib.ntu.edu.tw//handle/246246/276520

Abstract

There are two inputs in the system developed in this thesis, which are lyric text files and song wav files. The goal is to automatically mark the lyrics with the time codes so that the lyrics can be displayed when they are sung during music playing. We use forced alignment as our core architecture, which is based on the Hidden Markov Models (HMMs). The HMMs are first trained with the speech data, and then adapted with the songs, so that the adapted models will be more suitable for processing the singing voice over music. We adopt the Maximum a Posteriori (MAP) adaptation strategy. In order to do forced alignment, some preprocessing steps on the lyrics and audio songs are necessary. We also need an initial set of HMMs. For the lyrics, we perform word segmentation first, and then look up the phone sequence of each word in the lexicon so that we can know the phone sequence of the song. As for the audio songs, we use HTK to get the Mel-scale Frequency Cepstral Coefficients (MFCCs) from the wav files. We use the anchor reporter speech in the Mandarin Chinese Broadcast News Corpus (MATBN) to train 151 HMMs as the initial set of models. The training speech data were collected from November 2001 to December 2002. Among the 151 HMMs, there are 112 initial models, 38 final models, and one silence model. These 112 initial and 38 final HMMs are called the speech model, and the combination of speech and silence models is called the spoken voice model (SpoModel). With the phone sequence of the lyrics, the MFCCs of the audio signal, and the initial set of HMMs, we can use HTK to perform forced alignment. However, in order to make the models more robust against the background music, we conduct MAP adaptation on the initial models with some training songs. We have collected two types of training songs, namely Chinese pop songs and Chinese rap songs. Therefore, there are two sets of adapted models, which are called the pop song model (PopModel) and the rap song model (RapModel), respectively. We run forced alignment experiments on the two sets of adapted models and the test songs of two genres. The experimental results show that the genre has a big impact on the results of automatic lyrics-to-audio alignment.

Subjects

forced alignment

Hidden Markov Model

Maximum a Posterior

lyric-to-audio alignment

Type

thesis

Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)