李秀惠臺灣大學:資訊工程學研究所江文成Chiang, Wen -ChengWen -ChengChiang2010-06-092018-07-052010-06-092018-07-052009U0001-0501200903341900http://ntur.lib.ntu.edu.tw//handle/246246/185383隨著網際網路的盛行,網路文件慢慢累積成一個虛擬的龐大資料庫。如何從這大量的網路文件中,正確且有效率地找到想要的文件,已成為一個重要的議題。文件分類是資料檢索中常用的一個技術,分類的好壞往往會影響到檢索結果,所以一個好的文件分類系統,不但可以幫助管理者更方便整理文件也同樣可以讓使用者更有效率地檢索到真正想要的文件。 隨著時代潮流的改變,不同時期的網路文件就會有不同的文章風格,比如在各個領域中,不同時期所研究或討論的東西會隨著熱門程度,而影響到某些熱門字詞出現在文件中頻率;或是某篇文章也可能因為時代認知的不同,而被分類到不同的類別。在傳統的文件分類系統中,是不會考慮到時間特性的,直覺上這樣的文件分類系統,如果文件集本身橫跨了多個時期,那麼分類的結果可能就會不盡理想。 我們在論文中,會先設計一些時間性的分析實驗,來驗證藉此分析的確能夠在分類結果得到不錯改善。進而提出一個基於時間分析的文件自動分類方法,藉由在文件上的時間特性,來訓練出不同時期的分類器。透過事先的時間因素的分析,一方面能夠減少分類器所需的訓練資料集,也讓我們得到更好的分類效果。The popular use of the Internet has increased the amount of information which is accessible and stored through the web. Therefore retrieving a great deal of the information efficiently is becoming more and more important. Automatic Documents Classification (ADC) is a common strategy to associate the information with semantically meaningful classes and can improve the efficiency. However traditional ADC doesn’t consider temporal factor when constructing classifier. New information may appear or specific terms may disappear with time. These characteristics would lead into different classification of some documents in different time. We first discuss several temporal issues and design experiments to evaluate the influence of temporal factor on classification. Finally we propose our temporal analysis strategy to explore optimum training set for constructing temporal classifier. With the temporal analysis process, we reduce the amount of data for training classifier and improve the classification performance.中文摘要 1bstract 2hapter 1 Introduction 3.1 Motivation 3.2 Research Objectives 4.3 Organization of This Thesis 6hapter 2 Background 7.1 Automatic Document Classification 7.2 Support Vector Machine (SVM) 13.2.1 SVM Concepts 13.2.2 Non-Linear Classification 17hapter 3 System Architecture 19.1 System Overview 19.2 Data Extraction 20.3 Data Pre-Processing 23.4 Temporal Effects Analysis 25.5 Extraction of Optimum Training Set 27hapter 4 Characterization of Sampling and Temporal Effects 28.1 Characterizing the Sampling Effects 28.1.1 Sampling Effects of Year 29.1.2 Sampling Effect of the Whole Corpus 30.2 Characterizing and Quantifying the Temporal Effects 33.2.1 Selection of Training Data 33.2.2 Evaluation of Class Distribution 36.2.3 Evaluation of Class Similarity 41hapter 5 Experiments and Results 45.1 Pre-procedure for Exploring Optimum Training Set 45.2 Peak Accuracy Distribution 46.3 Exploring the Optimum Training Set 49hapter 6 Conclusion and Future Works 59.1 Conclusion 59.2 Discussion 59.3 Future Works 60eferences 611943799 bytesapplication/pdfen-US時間分析文件分類分類器temporal analysisoptimum training setdocument classificationclassifier基於時間分析之文件自動分類系統Automatic Document Classification Based on Temporal Analysisthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/185383/1/ntu-98-R95922038-1.pdf