陳信希臺灣大學:資訊工程學研究所嚴聖筌Yen, Sheng-ChuanSheng-ChuanYen2007-11-262018-07-052007-11-262018-07-052006http://ntur.lib.ntu.edu.tw//handle/246246/53942近年來,由於網路日誌,也就是部落格,簡單便利的操作方式,使得愈來愈多的人們開始使用部落格在網路上分享每天的生活、對事物的看法與心情。根據著名的部落格搜尋引擎Technorati的報告指出,部落格的數量已達到7,000萬個,並且平均每秒會創造1.4個新的部落格。部落格吸引學者們注意,變成一個非常熱門的研究方向。 “人們想利用部落格搜尋引擎找什麼?”,這個問題引起了我們的興趣。人們在使用部落格搜尋引擎時,不但想要找到較專業的文章,也想找到由一般使用者所寫的心得及想法。我們發現部落格中有一半的文章是轉錄的,而不是由部落格作者自己寫的。這些轉錄文章的內容大部份是從新聞網站、或一般的官方網站複製轉貼到部落格上。我們稱這樣的行為叫隱含式轉錄,意指有轉錄的行為卻沒有遵從部落格轉錄的協定。 在本篇論文中,我們分析討論隱含式轉錄在部落格空間上的特性及行為,並提出幾種有效率的方法,分辨轉錄及非轉錄的部落格文章,也就是辨識出隱含式轉錄。接著運用隱含式轉錄的特性,實作一個輔助部落格搜尋引擎的部落格推薦系統。 最後總結在辨識隱含式轉錄的實驗結果,並提出一些有趣的議題供未來研究,更深入探討隱含式轉錄。Recently, the easy-to-use user interface of weblog (blog) makes more and more people begin to share daily lives, opinions and feelings about things. According to the report from Technorati, a famous blog search engine, the number of blogs reached 70 million, and about 1.4 blogs are created every second of every day. Blog is a popular research topic now. Researches on blog search also attract many researchers’ attentions. “What do people want to search in blogosphere?” This question motivates us. In blog search, users are interested not only in professional pages, but also in general articles which containing personal opinions and perspectives. As we investigate the blogosphere, we discovered that many blog posts are full of trackback contents. By trackback contents, we mean that the contents of blog posts are not written by bloggers. These contents are usually copies from new websites or official websites. We call this action as implicit trackback, which are trackbacks without explicit links to the referred entries and that do not send any acknowledgement to the original site. In this thesis, we analyze and discuss implicit trackbacks and propose several efficient approaches to detect them. Then, we implement a blog recommendation system for blog search based on the features of implicit trackbacks. Finally, we conclude with some unsolved but quite interesting problems to study in the future.CHAPTER 1 INTRODUCTION 1 1.1 MOTIVATION 1 1.2 BACKGROUND 4 1.2.1 Blog 4 1.2.2 Trackback 6 1.2.2.1 Explicit Trackback 7 1.2.2.2 Implicit Trackback 7 1.3 RESEARCH PROBLEM 7 1.4 ORGANIZATION OF THE THESIS 8 CHAPTER 2 RELATED WORKS 10 2.1 NEAR-DUPLICATE DETECTION 10 2.1.1 Full-text-Based Methods 10 2.1.2 Clustering-Based Methods 11 2.1.3 Fingerprint/Signature-Based Methods 11 2.2 SENTIMENT ANALYSIS IN BLOGS 12 2.3 BLOG APPLICATIONS 13 CHAPTER 3 BLOG RECOMMENDATION SYSTEM 17 3.1 OVERVIEW 17 3.2 EXTRACTION LAYER 18 3.3 PRE-PROCESSING LAYER 19 3.4 CLASSIFICATION LAYER 21 3.4.1 Method 1: Snippet Matching 22 3.4.2 Method 2: Employing Domain Names 26 3.4.3 Method 3: Using Neighbor Segment 29 3.4.4 Method 4: Using URL Similarity 33 3.4.5 Method 5: Learning-Based Approach 36 3.4.5.1 Features Selection 37 3.4.5.1.1 Local Features 37 3.4.5.1.2 Global Features 41 3.4.5.2 Classification Methods 41 3.5 SUMMARIZATION LAYER 41 3.5.1 Clustering 41 3.5.2 Opinion Ranking 42 3.6 PRESENTATION LAYER 42 CHAPTER 4 EXPERIMENTS AND DISCUSSION 44 4.1 SOURCE 44 4.2 ANSWERS ANNOTATION 45 4.3 EXPERIMENTAL CORPUS ANALYSIS 47 4.3.1 Document Level 47 4.3.2 Segment Level 50 4.4 EVALUATION 52 4.4.1 Evaluation Metric 52 4.4.2 Experiment in Method 1: Snippet Matching 53 4.4.2.1 Experiment Setup 53 4.4.2.2 Experiment Results and Discussions 53 4.4.3 Experiment in Method 2: Employing Domain Names 54 4.4.3.1 Experiment Setup 54 4.4.3.2 Experiment Results and Discussions 55 4.4.4 Experiment in Method 3: Using Neighbor Segment 56 4.4.4.1 Experiment Setup 56 4.4.4.2 Experiment Results and Discussions 56 4.4.5 Experiment in Method 4: Using URL Similarity 57 4.4.6 Experiment in Method 5: Learning-Based Approach 58 4.4.6.1 Experiment Setup 58 4.4.6.2 Experiment Results and Discussions 58 4.5 PERFORMANCE OF ALL METHODS 59 CHAPTER 5 CONCLUSION AND FUTURE WORKS 62 5.1 CONCLUSION 62 5.2 FUTURE WORKS 63 REFERENCES 64en-US部落格隱含式轉錄部落格搜尋weblogimplicit trackbackweblog search網路日誌轉錄辨識與搜尋應用之研究Identifying Implicit Trackback in Weblogs and Its Application on Weblog Searchthesis