Performance Evaluations of Clustering Algorithms for Categorical Variables, Illustrated with SNPs
Date Issued
2016
Date
2016
Author(s)
Chen, Hung-Che
Abstract
Digital data and information are being generated at an escalating speed, especially in human modern life. Dealing with such large amounts of information has become an important issue for scientists. One way to reduce such large volume of data is via clustering. The development of clustering algorithms has a long history. Most of them, however, aimed at continuous observations, such as age and weight. For categorical data, not many algorithms have been proposed, not to mention for data that are of a greater size. In this paper we evaluate the performance of various clustering algorithms for categorical variables. Specifically, we compare three algorithms, K-modes, Hamming distance-based clustering algorithms (HD cluster), and RObust Clustering using linKs (ROCK). We investigate how their performances are affected by the frequencies of variables and the correlation between variables. The criteria for their performance evaluation are Rand Index (RI), Adjusted Rand Index (ARI), Number in Wrong Clusters, C-impurity, and Normalized Mutual Information (NMI). Simulation studies are conducted for illustrations to compare all three algorithms. The results show that the HD cluster performs better than or at least the same as the other two algorithms in all tested cases. Finally we discuss limitations and future directions for the HD cluster algorithm.
Subjects
large volume
categorical data
Type
thesis
File(s)![Thumbnail Image]()
Loading...
Name
ntu-105-R02849031-1.pdf
Size
23.32 KB
Format
Adobe PDF
Checksum
(MD5):66138f48ffd632e7ec3fe5fe605da991
