Using Hamming distance for SNP sets clustering analysis
Date Issued
2011
Date
2011
Author(s)
Kao, Wen-Hsin
Abstract
Background
With the recent advancement in laboratory technology, scientists are able to genotype thousands or millions of markers for genetic association studies of complex diseases. This large number of markers leads to difficulties in analysis. Therefore, reducing effectively the dimension for further analysis becomes an important issue. Another advantage of dimension reduction is, after clustering SNPs sharing similar features into one group, the small effect of each single SNP will not be overlooked. Such clustering may help laboratory scientists to identify novel association between markers and disease, and may help biological interpretations. The aim of this study is to provide a suitable clustering method for SNP observations.
Materials and methods
Among dissimilarity measures, Hamming distance is a simple and popular dissimilarity measure for string data. Here based on Hamming distance we propose three dissimilarity measures to represent the distance between two SNP clusters. Next, we use this measurement in a clustering algorithm, particularly the hierarchical clustering algorithm for its better explanation of subgroup structures, to create a tree structure called dendrogram for the data under study. To evaluate the performance of our approaches, we simulate SNP genotypes based on the coronary artery disease (CAD) study from the Wellcome Trust Case Control Consortium (WTCCC). And we use accuracy, sensitivity, specificity, adjusted Rand index and normalized mutual information as the criteria to compare with other existing methods.
Results
We propose a hierarchical clustering method for SNP sets based on Hamming distance. The simulation studies show that our approaches perform better or as well as than those proposed in literature. When the number of clusters is unknown and needs to be determined, we recommend the maximum difference in adjacent dissimilarity measures as a threshold.
Discussion
Our proposal utilizes Hamming distance to measure the similarity between SNP strings. This is similar to LD but focuses more on inter-personal similarity. The approaches can be extended to other modes of inheritance by changing the coding of SNP genotypes.
Subjects
Adjuster Rand index
Clustering
Genetic association study
Hamming distance
Hierarchical clustering
Normalized mutual information
Type
thesis
File(s)![Thumbnail Image]()
Loading...
Name
ntu-100-R97842024-1.pdf
Size
23.32 KB
Format
Adobe PDF
Checksum
(MD5):663126cb35965043c7fb90d1ce34dd64
