拾玉研究計畫【建構混合型資料集群分析使用之機率距離測度】

2023-08-012024-05-13https://scholars.lib.ntu.edu.tw/handle/123456789/652166集群分析是一種非監督學習方法，其目的在於發現未標記數據資料中潛在的分群結構。集群分析已廣泛應用於許多實務領域上，如資料探勘、文本探勘、生物資訊學和機器學習理論等。現有集群演算法大多針對處理資料集中僅有連續型數據或類別型數據組成時進行分群分析。然而，在實際數據情況下，資料集由數值型和類別型數據混合組成，並在不同的尺度上進行測量，例如在全基因體資料庫中，除連續型基因表現資料外，還包含如性別及疾病期別等一些類別型臨床變項。現今有許多針對混合類型數據所開發的集群分析演算法，希望能發現混合類型數據中潛藏的分群結構。在混合類型數據的集群演算法中，最常用的方法之一是K-prototypes演算法。然而，與連續型資料的集群演算法相比，混合型數據的集群演算法仍有相當改進的空間。本研究的主要目的是為混合類型數據開發一種數據座標系統，該座標系統得以將一組混合屬性數據映射到歐幾里得空間。透過該座標系統可以機率呈現兩個混合數據點的相似程度。本研究的第二個目標是比較混合類型數據集群分析演算法的分群有效性，使用實際和模擬資料來描述不同混合類型數據集群演算法的優劣勢情形。雖然目前已經有許多算法提出來處理混合類型數據分群問題，但目前尚未有研究對這些演算法進行全面比較。此外，本研究中會考慮不同分群演算法結合所提出的座標系統進行討論。 Clustering is an unsupervised learning method for discovering the group structure inherent in an unlabeled dataset. Clustering has been widely used in many areas, such as data mining, text mining, bioinformatics and machine learning. Most of the existing algorithms are proposed to deal with data clustering when dataset is composed of all numerical or categorical attributes only. However, in real data situations datasets made of both numerical and categorical attributes are measured on different scales, e.g. quantitative gene expression values and categorical clinical features like gender, disease stage etc. To discover group pattern inherent in the mixed type data, a number of algorithms have been developed to deal with the mixed type data clustering. Among the clustering algorithms for mixed type data, one of the most commonly used method is the K-prototypes algorithm. Nevertheless, there are still rooms for improvement of clustering in comparison to the clustering algorithms for numerical data. The objective of this study is to develop a data-representation scheme for the mixed type data, which maps a set of mixed attributes data into the Euclidean space. The proposed measure attempts to tackle the similarity between two mixed-type data points by the probability that the observed attributes can be seen in a random sample of two data points. The second goal in this proposal is to investigate the performance of the mixed-type data clustering algorithms, using empirical and simulated data to characterize where the competing methods agree and where they depart. Although many algorithms have been proposed to deal with the mixed type data clustering, there currently exists no comprehensive comparison of those algorithms. In this study, we will cover a selection of objective function-based clustering algorithms.集群分析;非監督學習;混合類型數據;K-prototypes演算法;數據座標系統;歐幾里得空間;集群演算法。;Clustering; unsupervised learning; mixed type data; K-prototypes algorithm; data-representation scheme; Euclidean space; objective function-based clustering algorithms拾玉研究計畫【建構混合型資料集群分析使用之機率距離測度】