基於聲音資訊的婚禮影片索引

吳家麟Wu, Ja-Ling臺灣大學:資訊工程學研究所方劭彥Fang, Shao-YenShao-YenFang2010-06-022018-07-052010-06-022018-07-052008U0001-2801200814470500http://ntur.lib.ntu.edu.tw//handle/246246/184861　　有越來越多的人們使用數位攝影機來記錄他們生活上所發生的一些特別的事件，例如婚禮是我們生活中一項很重要的儀式，遠方的親戚或許久不見的朋友藉機聚在一起，所以常會拍攝影片去紀念它，這就是所謂的婚禮影片。然而影片卻常常被放在儲存媒體中而沒有再去看過它。原始未經整理的影片讓人無法輕易的觀看，因此我們需要作影片摘要的處理。　在傳統的影片摘要處理上常應用「主要顏色(dominant color)」、「鏡頭移動(camera motion)」、「場景轉換(scene change)」等視覺資訊，但這些技術並不能很好的運用在婚禮影片上；相反的，在婚禮影片的聲音訊號中卻有很豐富的資訊。但是在婚禮影片的拍攝中環境雜音的影響是不可避免的，然而大部份的聲音訊號處理技巧都是在乾淨無噪音的環境下進行的，而且這些技巧在有噪音的狀況下不是很準確，因此我們提出可以抵抗噪音影響的實驗方法。　首先將聲音訊號抽取出各項特徵並選取適合的特徵群後進行講演/音樂的辨別(speech/music discrimination)，其中音樂的部份再選取被推薦的特徵群後進行語音/非語音的辨別(vocal/non-vocal discrimination)，根據以上兩種辨別方法的結果經過移動平均線(moving average)的平滑化(smooth)後，我們將輸入的聲音訊號切成一個個的片段。而講演的部份由相對的靜音偵測(silence detection)找出斷句的所在，再進行基於句子之間差異的演說者轉換偵測(speaker change detection)。輸入的聲音訊號同時也進行掌聲偵測(clap detection)。綜合以上實驗的結果，我們將前面兩種辨別方法所得到的片段與婚禮的事件逐一配對。　由各項實驗的結果，可以得到由每個片段的特徵值欄位組成的表格，每一個片段的欄會描述該片段是講演或是音樂、有語音或沒有語音、有沒有演說者改變及有沒有掌聲出現。每個片段根據其特徵欄位與婚禮事件的性質作配對，例如一個是音樂且沒有語音的片段會與樂器演奏配對，一個是講演、沒有演說者改變且沒有偵測出掌聲的片段會與牧師禱告配對。　由於目前對於婚禮事件性質的瞭解尚有不足，實驗中分段的結果及其它聲學事件偵測可能產生誤差，所以實際真相和配對的結果仍然有一定程度的差異。為了提昇婚禮事件配對結果的準確度，我們提出了一個簡單的錯誤更正機制。如同一般有故事結構的事件序列有起承轉合，婚禮事件也符合這樣的架構。因此我們將前面所得到婚禮事件配對的結果序列加以分群，而無法被分群的片段則被視為錯誤；我們的目標就是更正無法被分群的片段所產生的錯誤，產生錯誤的片段用適當的配對規則重新配對。圖五是一個婚禮影片經過重新修正的結果，其中一個錯誤的事件被更新成正確的事件而提昇了準確率。　在本篇論文中，婚禮影片藉由我們所提出可以抵抗噪音影響的講演/音樂辨別及語音/非語音辨別分成許多片段；接著這些片段藉由我們所提出的演說者轉變偵測和掌聲偵測標記上相對應的婚禮事件；最後經由修正的機制更正不符合婚禮架構的錯誤。　　People tend to use digital video recorder to capture their lives, for example wedding is one of important ceremonies in our life, and people usually film a video record to commemorate it. But then the videos are usually put into storage and never watch again, because the raw video is hard to turn into compelling video story. Thus we need to apply the video summarization. Visual information such as dominant color, motion, scene change is usually used in traditional video summarization, but it is not well applicable in wedding video. On the other hand the audio information is meaningful. It is hard to avoid the noise in wedding videos, however most audio processings such as speech/music discrimination are dealt with in clean environment in the literature, and the performance of them are not good enough with noise, thus we develop the noisy environment resisted speech/music discrimination and vocal/non-vocal discrimination. In addition, contrast to other papers that apply low level acoustic features, we combine the results of speaker change detection and clap detection with our wedding event matching procedure. Distinguishably to other papers which focus on the signal processing, we apply a refine algorithm to re-correct the mismatched events to improve the performance of our proposed work.　In this thesis, the given wedding videos are divided into several segments by speech/music discrimination and vocal/non-vocal discrimination which are developed by our proposed work and can resist the noisy environment. Then the obtained segments will be labeled to associated wedding events assisted with speaker change detection and clap detection which are developed by our proposed work. Finally the labeled events will be revised by our refine algorithm that tried to re-match the mismatch events which are not fit for the wedding structure.Chapter 1 Introduction 1　Section 1.1 Framework 4hapter 2 Related Work 6　Section 2.1 Speech/music discrimination 6　Section 2.2 Speaker change detection 7　Section 2.3 Clap detection 8hapter 3 Discrimination and Acoustic Event Detection 9　Section 3.1 Noise reduction 9　Section 3.2 Speech/music discrimination 10　　　Section 3.2.1 Decision method 13　　　　　Section 3.2.1.1 Rule based approach of decision 14　　　　　Section 3.2.1.2 Voting schemes 16　　　　　Section 3.2.1.3 Machine learning based classification 17　　　　　　　Section 3.2.1.3.1 Moving average 17　　　Section 3.2.2 The discovery of effective features 18　　　　　Section 3.2.2.1 Line number 19　　　　　Section 3.2.2.2 Likelyhood ratio crossing rate 21　　　　　Section 3.2.2.3 Low energy crossing rate 22　　　Section 3.2.3 More features 23　Section 3.3 Vocal/non-vocal discrimination 25　Section 3.4 Silence detection 26　Section 3.5 Speaker change detection 27　Section 3.6 Clap detection 28　　　Section 3.6.1 Clap property 28hapter 4 Wedding event matching 32　Section 4.1 Matching rules 32　Section 4.2 Refinement 35hapter 5 Conclusions and future work 39eference 41application/pdf1312185 bytesapplication/pdfen-US婚禮事件配對講演/音樂的辨別語音/非語音的辨別移動平均線靜音偵測演說者改變掌聲偵測wedding event matchingspeech/music discriminationvocal/non-vocal discrimination, moving averagesilence detectionspeaker change detectionclap detection基於聲音資訊的婚禮影片索引Audio Information Based Wedding Video Indexingthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/184861/1/ntu-97-J94922025-1.pdf