以時間分析與多維度語句呈現為基礎之熱門話題萃取

曹承礎臺灣大學：資訊管理學研究所陳冠宇Chen, Kuan-YuKuan-YuChen2007-11-262018-06-292007-11-262018-06-292005http://ntur.lib.ntu.edu.tw//handle/246246/54307主題偵測(Topic Detection)是主題偵測與追蹤(Topic Detection and Tracking)裡其中一個研究領域，該領域試著從新聞媒體裡，進行搜尋、組織及建構文字形式的新聞資料。我們的研究為如何偵測「熱門」的主題(Hot Topic)。所謂熱門的主題，是在某一段時間之內，它會被很多人常常討論與報導。在之前的研究裡，可透過TF*PDF計算文字權重的式子，找到描述熱門主題的「熱門字」 (Hot Term)。不過它仍然會有一些問題存在：(一)只以字的出現頻率和文獻比例頻率為基礎的TF*PDF，萃取熱門字會導致不可靠的結果;(二)只用單一的句子向量並不足以表達句子的涵義。因此我們提出了改良的熱門主題的萃取系統，來解決上述的兩個問題。首先，我們透過紀錄字在時間上的使用變化，來萃取熱門字;也就是說，追蹤任一個字的生命週期，可以幫助我們來分辨它是否為足以描述「熱門」主題的字。之後，我們使用多維度的句子向量，來描述句子的資訊。最後，我們對所有新聞報導裡的句子進行叢集(cluster)，而每一個叢集代表著一個新聞話題。透過以上兩個流程的改善，根據實驗結果顯示，不但增進了每一個叢集的品質，也能夠萃取出一段時間內所包含的熱門主題。Topic detection is part of the Topic Detection and Tracking field, which seeks to develop technologies that search, organize, and structure news-oriented textual materials from various broadcast news media. We are interested in detecting “hot” topics that are frequently discussed by people in a given period of time. A prior work on hot topic extraction that designed an innovative term-weighting scheme called TF*PDF, which extracts “hot” terms that can describe hot topics. One of the problems that happens in the process of extracting hot topics using TF*PDF is the unreliability of results when the weight is determined solely on term frequency and document frequency. Another problem is that using one single vector misrepresents the meaning of a sentence. We propose a hot topic extraction system that aims to solve the two problems mentioned above. First, we extract the hot terms by capturing their variations of the time distribution within a timeline. In other words, tracking the life cycles of the terms can help us differentiate which term is a real hot term that describes a hot topic. Second, we use multi-dimensional sentence vectors to feature the information of a sentence. Finally we group the sentences of news report into clusters, which represent hot topics. Clustering the sentences by the multi-dimensional sentence vectors not only improves the quality of each cluster, but also extracts most of the actual hot topics over a period of time.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Objective 3 1.3 Organization 3 Chapter 2 Literature Review 4 2.1 The Tasks of TDT Program 4 2.1.1 The Definition of “Topic” 4 2.1.2 Topic Tracking 5 2.1.3 Topic Detection 5 2.1.4 New Event Detection 5 2.1.5 Story Segmentation 5 2.1.6 Story Link Detection 6 2.2 Term-Weighting Schemes 6 2.2 Topic Extraction with TF*PDF 11 2.3 Event Detection with Temporal Information 13 Chapter 3 System Design 20 3.1 System Architecture 20 3.2 Text Preprocessing 21 3.3 Hot Term Generator 22 3.4 Sentence Modeling 27 Chapter 4 Experiment Analysis 32 4.1 Data Source 32 4.2 System Parameter 32 4.3 Term Weighting Analysis 33 4.3 Sentence Clustering Analysis 42 Chapter 5 Conclusion and Future Work 50 5.1 Conclusion 50 5.2 Future Work 51 Bibliography 53en-US主題偵測熱門主題萃取詞頻與文獻比例頻率熱門字多維度句子向量Topic DetectionHot Topic ExtractionTF*PDFHot TermMultidimensional Sentence Vector以時間分析與多維度語句呈現為基礎之熱門話題萃取Hot Topic Extraction with Timeline Analysis and Multidimensional Sentence Modelingother