Hot Topic Extraction with Timeline Analysis and Multidimensional Sentence Modeling
Date Issued
2005
Date
2005
Author(s)
Chen, Kuan-Yu
DOI
en-US
Abstract
Topic detection is part of the Topic Detection and Tracking field, which seeks to develop technologies that search, organize, and structure news-oriented textual materials from various broadcast news media. We are interested in detecting “hot” topics that are frequently discussed by people in a given period of time. A prior work on hot topic extraction that designed an innovative term-weighting scheme called TF*PDF, which extracts “hot” terms that can describe hot topics. One of the problems that happens in the process of extracting hot topics using TF*PDF is the unreliability of results when the weight is determined solely on term frequency and document frequency. Another problem is that using one single vector misrepresents the meaning of a sentence.
We propose a hot topic extraction system that aims to solve the two problems mentioned above. First, we extract the hot terms by capturing their variations of the time distribution within a timeline. In other words, tracking the life cycles of the terms can help us differentiate which term is a real hot term that describes a hot topic. Second, we use multi-dimensional sentence vectors to feature the information of a sentence. Finally we group the sentences of news report into clusters, which represent hot topics. Clustering the sentences by the multi-dimensional sentence vectors not only improves the quality of each cluster, but also extracts most of the actual hot topics over a period of time.
Subjects
主題偵測
熱門主題萃取
熱門字
多維度句子向量
Topic Detection
Hot Topic Extraction
TF*PDF
Hot Term
Multidimensional Sentence Vector
Type
other
