Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs
Date Issued
2016
Date
2016
Author(s)
Lin, Lien-Chiao
Abstract
There are two inputs in the system developed in this thesis, which are lyric text files and song wav files. The goal is to automatically mark the lyrics with the time codes so that the lyrics can be displayed when they are sung during music playing. We use forced alignment as our core architecture, which is based on the Hidden Markov Models (HMMs). The HMMs are first trained with the speech data, and then adapted with the songs, so that the adapted models will be more suitable for processing the singing voice over music. We adopt the Maximum a Posteriori (MAP) adaptation strategy. In order to do forced alignment, some preprocessing steps on the lyrics and audio songs are necessary. We also need an initial set of HMMs. For the lyrics, we perform word segmentation first, and then look up the phone sequence of each word in the lexicon so that we can know the phone sequence of the song. As for the audio songs, we use HTK to get the Mel-scale Frequency Cepstral Coefficients (MFCCs) from the wav files. We use the anchor reporter speech in the Mandarin Chinese Broadcast News Corpus (MATBN) to train 151 HMMs as the initial set of models. The training speech data were collected from November 2001 to December 2002. Among the 151 HMMs, there are 112 initial models, 38 final models, and one silence model. These 112 initial and 38 final HMMs are called the speech model, and the combination of speech and silence models is called the spoken voice model (SpoModel). With the phone sequence of the lyrics, the MFCCs of the audio signal, and the initial set of HMMs, we can use HTK to perform forced alignment. However, in order to make the models more robust against the background music, we conduct MAP adaptation on the initial models with some training songs. We have collected two types of training songs, namely Chinese pop songs and Chinese rap songs. Therefore, there are two sets of adapted models, which are called the pop song model (PopModel) and the rap song model (RapModel), respectively. We run forced alignment experiments on the two sets of adapted models and the test songs of two genres. The experimental results show that the genre has a big impact on the results of automatic lyrics-to-audio alignment.
Subjects
forced alignment
Hidden Markov Model
Maximum a Posterior
lyric-to-audio alignment
Type
thesis