A Protein Sequence Clustering Algorithm Based on Statistical Models
Date Issued
2004
Date
2004
Author(s)
Chung, Wen-Chin
DOI
zh-TW
Abstract
Protein sequence clustering can group the homologous proteins together based on pair-wise sequence similarities. The conventional single-linkage clustering algorithm has been widely used on this problem because it successfully utilizes the transitivity property to identify remote homologues and provides a dendrogram as clustering result that is useful for protein family analysis. However, due to the twilight zone embedded in the distribution of pair-wise similarities, sometimes the single-linkage algorithm generates clusters with low sensitivity for large families or families with noisy relationships to the members of other protein families. In this thesis, a hybrid hierarchical clustering algorithm is proposed to improve the quality of a dendrogram generated by the single-linkage clustering algorithm. By creating pair clusters, a single protein can exist in distinct hierarchical paths of a dendrogram. Next, the proposed algorithm employs the skewness and kurtosis indices to control the formation of subclusters, in order to generate highly homologous clusters at the bottom level of a dendrogram. Finally, selecting pivots of a subcluster in the following clustering process avoids the chaining effect it might be caused by the single-linkage algorithm. Thus the proposed algorithm can produce clusters with both high sensitivity and specificity at the higher level of a dendrogram. The experimental results in this thesis showed that the hierarchy outputted by the proposed algorithm matches the hierarchy of protein families better than the hierarchy generated by the single-linkage algorithm. In this regard, the generated hierarchy can provide automatic annotations for new protein with higher accuracy than the previous approaches.
Subjects
統計模型
階層式
分群
蛋白質序列
statistical models
hierarchy
clustering
protein sequence
Type
thesis
File(s)![Thumbnail Image]()
Loading...
Name
ntu-93-R91922093-1.pdf
Size
23.31 KB
Format
Adobe PDF
Checksum
(MD5):83f70cf8588d2ee71e064ae759bf6fe7
