陳銘憲臺灣大學:電機工程學研究所朱怡虹Chu, Yi-HongYi-HongChu2010-07-012018-07-062010-07-012018-07-062009U0001-2007200912504600http://ntur.lib.ntu.edu.tw//handle/246246/188019資料叢集分析是資料探勘領域中一項非常重要的技術。很多資料叢集演算法採用以密度為量的方式去挖掘資料叢集:它們將資料叢集視為空間中密度很高的區域。然而,我們發現兩個很重要但尚未被充分解決的問題: 如何挖掘隱藏在時間裡的資料叢集,如何挖掘隱在空間裡的資料叢集。我們將依序探討這兩個問題並設計相關的資料叢集技術。先,我們研究具時間感知之資料叢集技術。我們發現之前的資料叢集演算法大多是全部的資料下找尋資料叢集,然而資料的特性會隨著時間做改變,有些資料叢集只會存於某個時間區間裡,而當我們是在整體資料下挖掘資料叢集時這些隱藏在時間區間裡的料叢集反而不會被挖掘出來。因此,能讓使用者可以指定不同的時間區間來挖掘資料叢是非常重要的。有鑒於此,我們設計了Temporal cluster query 這個問題來讓使用者在不時間區間找尋資料叢集。然而,因為欲挖掘資料叢集的時段無法事先知道,若要使用之的演算法就只能等到使用者提出查詢後再執行這些演算法,這樣的方法非常沒效率且不合即時互動式的查詢環境。因此,我們再提出了一個新穎的QEC 架構來有效率的執行emporal cluster query。下來,我們研究具空間感知之資料叢集技術。對於高維度資料,歷史文獻中的研究而去找隱藏在子空間的資料叢集。然而我們發現到之前的子空間叢集挖掘演算法會找出量的子空間資料叢集,但這些叢集有很嚴重的資訊重複的問題。而且因為有些存在於低度叢集的資料並沒有被包含在任何高維度資料叢集內,因此若直接刪除那些與高維度叢有很大資訊重複的低維度叢集,將可能導致遺失了那些存在於被刪除的叢集中但卻沒有包含在高維度叢集中的資料的資訊。為了解決這個問題,我們提出了NORSC演算法來自的找出一組精簡的子空間資料叢集但同時又可以維持一定的資料涵蓋程度。後,我們更進一步的提出一個新的子空間資料叢集探勘模型。因為我們發現到之前子空間叢集挖掘演算法對於不同子空間維度下的資料叢集無法同時得到很好的品質,主的原因是因為他們忽略了一個很重要的現象:不同子空間維度裡的資料叢集會有不同的度。因此我們設計了一個新的子空間資料叢集探勘模型,我們將採用不同的密度參數來不同子空間維度裡的資料叢集。但是這新模型定義下的資料叢集將不再滿足單一性特性Monotonicity property),因此我們無法直接利用之前演算法根據單一性特性所設計的類priori技術來找新模性定義下的資料叢集。有鑒於此,我們更進一步的設計了DENCOS演法來有效率的挖掘新模型定義下的資料叢集。Data clustering has been recognized as an important and valuable technique in data mining field. Mostf clustering works adopt density-based approach to discover the clusters, where clusters are regardeds high-density regions in the space. In this dissertation, we explore two important but not well solvedopics in data clustering, i.e. discovering time-aware clusters and discovering space-aware clusters, andevise innovative clustering techniques to deal with these topics.e first devise time-aware clustering techniques. We note that most of previous clustering worksreat all data as one large segment and execute the clustering task over the entire database. However,he characteristics of the data may change over time. Some dense regions (clusters) may only exist inertain time intervals but will not be discovered if taking all data records into account. Thus, discoveringlusters over different time intervals is very important for users to get the interesting patterns hiddenn data. In view of this, we explore in this dissertation a novel problem, called temporal cluster query,o address the cluster discovery in constrained time intervals, where users can specify varying timentervals to discover the clusters. Since the queried time intervals are unknown in advance, the directxtension of previous clustering works would be to delay the cluster discovery until the user querieshe data set, which is, however, inefficient for an interactive query environment. Thus, we also devisen innovative framework, called QEC, to efficiently execute temporal cluster queries.fter that, we devise space-aware clustering techniques. For high-dimensional data, research advancen the literature turns to discover the clusters hidden in subspaces. However, we find that previousubspace clustering works will discover a large amount of subspace clusters but there exists large informationedundancy in the clustering result. In addition, since some data points in a lower-dimensionalluster may not be members of any higher-dimensional cluster, directly removing a cluster having largenformation overlapping with higher-dimensional clusters may cause the loss of information of thoseata points which are contained in this cluster but cannot be found in higher-dimensional clusters. Toemedy this, we devise the NORSC algorithm to automatically discover a succinct collection of subspacelusters while also maintaining the required degree of data coverage.urthermore, we devise a novel subspace clustering model to discover the subspace clusters. Wexplore that previous works are difficult to discover high qualities of the clusters in all subspaces sincehey lack of considering a critical problem, which is that densities vary in different subspace cardinalities.hus, we devise a new subspace clustering model, where different density thresholds wille utilized to discover the dense regions (clusters) in different subspace cardinalities. However, sincehe monotonicity property of the dense regions no longer exists in our subspace clustering model, thepriori-like generate-and-test scheme adopted in most previous works to constrain the searching ofense regions is infeasible in our subspace clustering model. For this challenge, we also devise thelgorithm DENCOS to efficiently discover the clusters based on this novel clustering model.1 Introduction 1.1 MotivationandOverviewof theDissertation . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 DiscoveringClusters inConstrainedTime Intervals . . . . . . . . . . . . . . . 2.1.2 Reducing Redundancy in Subspace Clustering . . . . . . . . . . . . . . . . . 3.1.3 DiscoveringDensityConsciousSubspaceClusters . . . . . . . . . . . . . . . 5.2 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Discovering Clusters in Constrained Time Intervals 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Temporal Cluster Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.2 Overview of the QEC Framework . . . . . . . . . . . . . . . . . . . . . . . . 12.4 OfflineMaintainingPhase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.1 UniformRegion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.2 Algorithm for Constructing the UR-tree . . . . . . . . . . . . . . . . . . . . . 17.5 Online Query Processing Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.5.1 Algorithm for Combining the UR-trees . . . . . . . . . . . . . . . . . . . . . 19.5.2 Execute the Temporal Cluster Query . . . . . . . . . . . . . . . . . . . . . . . 23.5.3 Analysis of the Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . 23.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.6.1 QualityofUR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.6.2 AccuracyofQEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.6.3 Time Efficiency of QEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Reducing Redundancy in Subspace Clustering 30.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.1.1 SubspaceClusteringinHigh-DimensionalData . . . . . . . . . . . . . . . . . 30.1.2 Motivation of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2 Relatedwork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36.3 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37.3.1 Redundant Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37.3.2 Identification of Redundant Cluster . . . . . . . . . . . . . . . . . . . . . . . 40.4 The NORSC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4.1 Mining Maximal Dense Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.4.2 Mining Non-Redundant Cluster . . . . . . . . . . . . . . . . . . . . . . . . . 49.4.3 Discussion on NORSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55.5.1 Study on Data Coverage Problem . . . . . . . . . . . . . . . . . . . . . . . . 56.5.2 AccuracyofNORSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57.5.3 Time Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59.5.4 SpaceEfficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5.5 Discuss on Redundancy Threshold . . . . . . . . . . . . . . . . . . . . . . . . 62.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Discovering Density Conscious Subspace Clusters 65.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68.3 Density Conscious Subspace Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 70.3.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70.4 TheDENCOSAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4.1 Preprocessing Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.4.2 DiscoveringPhase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85.5.2 Algorithm Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86.5.3 Algorithm Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Conclusions 971001864 bytesapplication/pdfen-US資料探勘資料叢集時間空間子空間資料叢集Data MiningData ClusteringTimeSpaceSubspace Clustering具時間和空間感知之資料叢集技術Time-Aware and Space-Aware Data Clustering Techniquesthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/188019/1/ntu-98-F91921022-1.pdf