項潔臺灣大學:資訊工程學研究所吳宗和Wu, Tsung-HoTsung-HoWu2007-11-262018-07-052007-11-262018-07-052006http://ntur.lib.ntu.edu.tw//handle/246246/54121過去的研究資料分群上,很少研究著重在類別性資料上的分群需求,其最大主因是因為類別性資料並沒有如數值性資料有直觀的相似度測量標準。在本論文中,我們提出了以群的觀點下,應用資訊熵合併後的成長差異,並且以此差異作為群之間的距離,應用引力分群的觀念,達到建立分群樹的目的。 另外我們認為分群時,群數的決定應該參考使用者提供的意見,我們認為使用者需要的是此分群結果最多存在幾群,並且在此上限內由機器決定最適合的群數,增加機器選擇群數的彈性,如此機器便不會需要強迫群做不必要的合併以達到特定群數,而在此想法下我們讓機器在應用引力分群時,找出最適合群數,並且為了確定此分群結果可以避免區域最佳解,我們使用隨機交換做各群的純化,拉大各群在我們定義的距離,最後能得到更好的分群結果。 而實驗結果則顯示我們的方法能有效地決定群數,符合使用者的期望,並且在此群數下能有很好且穩定的分群效果。Data clustering is an important research subject for identifying important properties of data sets. However, existing literature have been focusing mainly on numerical data. In this thesis, we have proposed a method to cluster categorical data. Our method proposes a similarity measurement for categorical data based on information entropy. It measures the difference between the original information entropy and the entropy after merging the data into clusters. We use this measurement as the distance between clusters and apply the gravity theory to obtain the final clusters. To avoid the problem of local optimum, we have incorporated randomizing schemes into our algorithm. The notion of digital search trees is also utilized to speedup the clustering process. Experiments using UCI ML repository data show that our algorithm produces results is quite competitive with those in the existing literature.Abstract........................................................................................................................1 第一章 緒論................................................................................................................5 1.1 資料分群的起由..........................................................................................5 1.2 數值性資料與類別性資料分群演算法的差異..........................................6 1.3 研究動機......................................................................................................6 1.4 研究目的......................................................................................................7 1.5 論文架構......................................................................................................8 第二章 類別性資料分群之相關研究.......................................................................9 2.1 符號與問題定義..........................................................................................9 2.1.1 符號定義..........................................................................................9 2.1.2 問題定義..........................................................................................9 2.2 類別性資料分群演算法之相關研究........................................................11 2.2.1 K-modes algorithm........................................................................11 2.2.2 Monte-Carlo algorithm..................................................................12 2.2.3 ROCK algorithm............................................................................13 2.2.4 CACTUS algorithm.......................................................................14 2.2.5 COOLCAT algorithm.....................................................................14 2.2.6 ACE algorithm...............................................................................15 2.3 相關研究總結............................................................................................16 第三章 混合引力行為與隨機交換之演算法..........................................................17 3.1 聚合型之引力演算法..............................................................................17 3.1.1 引力演算法..................................................................................18 3.1.2 兩群的相似度定義......................................................................19 3.1.3 由相似度延伸成距離定義..........................................................21 3.1.4 引力演算法演算過程..................................................................22 3.1.5 適當群數的選擇..........................................................................23 3.2 快速隨機交換演算法..............................................................................24 3.2.1 隨機演算法之問題......................................................................24 3.2.2 使用數字搜尋樹的區域性加快隨機交換..................................25 3.2.3 定義好的交換結果......................................................................27 3.2.4 快速隨機交換演算法之過程......................................................29 3.3 混合引力行為與隨機交換之演算法流程圖..........................................31 3.4 複雜度分析..............................................................................................32 第四章 實驗結果......................................................................................................33 4.1 真實資料集(Real Datasets)......................................................................35 4.1.1 實驗一(Congressional Vote Dataset).......................................35 4.1.2 實驗二(Mushroom Dataset).....................................................37 4.2 文件資料集(Document Datasets)........................................................39 4 4.2.1 實驗一(Reuters-21578 Dataset)...............................................39 4.2.2 實驗二(20 Newsgroups Dataset)..............................................42 4.3 實驗結果討論............................................................................................44 第五章 結論與未來展望......................................................................................46 5.1 結論............................................................................................................46 5.2 未來工作....................................................................................................47 Reference:.................................................................................................................48 中文文獻參考:..........................................................................................................51778353 bytesapplication/pdfen-US資料分群類別性資料Categorical Data ClusteringGravitational Algorithm混合「引力行為與隨機交換」演算法處理類別性資料分群之研究Categorical Data Clustering Using Gravitational Algorithm and Randomized Techniquethesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/54121/1/ntu-95-R93922082-1.pdf