以統計模型為基礎之複合式蛋白質序列分群演算法

歐陽彥正臺灣大學：資訊工程學研究所鐘文欽Chung, Wen-ChinWen-ChinChung2007-11-262018-07-052007-11-262018-07-052004http://ntur.lib.ntu.edu.tw//handle/246246/53901由於蛋白質序列資料庫的大量成長，我們需要有效率的蛋白質序列分析工具。而蛋白質序列分析最常使用的序列比對無法有效地偵測出蛋白質間的疏遠同源性，序列相似度與蛋白質同源性間具有一個不容易判斷的模糊地帶。蛋白質序列分群可以利用蛋白質序列相似度以及蛋白質家族的特性，找出具有同源性的蛋白質集合。我們提出一個以統計模型為基礎的階層式蛋白質序列分群演算法，由單一連結分群演算法加以改良，保留其階層式分群特性以符合蛋白質家族特性。首先利用建立配對群使單一蛋白質可以存在多個分群階層路徑中，再利用統計上常用的對稱度及曲率度找出具有高度同質性的蛋白質分群，最後以代表點建立後半段的分群階層以避免鏈結效應，並且找出具有疏遠同質性的蛋白質分群。本演算法經由SwissProt以及InterPro資料庫驗證，以人類蛋白質作為實驗集合，可以有效地建立出具有高度同質性的蛋白質分群，最後的分群結果也符合InterPro資料庫中蛋白質家族的階層特性，也避免了單一連結分群演算法的鏈結效應。在分群結果中，可以觀察到未知資訊的蛋白質與已知資訊蛋白質間的關聯性。配合我們所開發的分群檢視工具，可以由蛋白質、分群以及家族三個不同方向來觀察分群結果。Protein sequence clustering can group the homologous proteins together based on pair-wise sequence similarities. The conventional single-linkage clustering algorithm has been widely used on this problem because it successfully utilizes the transitivity property to identify remote homologues and provides a dendrogram as clustering result that is useful for protein family analysis. However, due to the twilight zone embedded in the distribution of pair-wise similarities, sometimes the single-linkage algorithm generates clusters with low sensitivity for large families or families with noisy relationships to the members of other protein families. In this thesis, a hybrid hierarchical clustering algorithm is proposed to improve the quality of a dendrogram generated by the single-linkage clustering algorithm. By creating pair clusters, a single protein can exist in distinct hierarchical paths of a dendrogram. Next, the proposed algorithm employs the skewness and kurtosis indices to control the formation of subclusters, in order to generate highly homologous clusters at the bottom level of a dendrogram. Finally, selecting pivots of a subcluster in the following clustering process avoids the chaining effect it might be caused by the single-linkage algorithm. Thus the proposed algorithm can produce clusters with both high sensitivity and specificity at the higher level of a dendrogram. The experimental results in this thesis showed that the hierarchy outputted by the proposed algorithm matches the hierarchy of protein families better than the hierarchy generated by the single-linkage algorithm. In this regard, the generated hierarchy can provide automatic annotations for new protein with higher accuracy than the previous approaches.第一章緒論 1 1.1 蛋白質分群的需求與應用 1 1.2 蛋白質家族的特性 2 1.2.1 同源性 (Homology) 2 1.2.2 同源遞移性 (Transitivity of Homology) 2 1.2.3 家族間的父子關係 3 1.2.4 多重家族蛋白質 (Multi-family Proteins) 3 1.3 蛋白質分群問題的挑戰與解決 3 1.4 論文架構 4 第二章蛋白質分群的相關研究 5 2.1 單一連結(SINGLE LINKAGE)分群演算法 5 2.2 PROTONET (2002) 6 2.3 GENERAGE (2000) 8 2.4 PROCLUST (2002) 9 2.5 總結 11 第三章以統計模型為基礎之複合式蛋白質序列分群演算法 12 3.1 蛋白質資料表示法 12 3.2 演算法內容 13 3.2.1 讀入蛋白質資料 13 3.2.2建立包含配對群及單一群的基本群(Based clusters) 14 3.2.3 以對稱度及曲率度建立分群階層 15 3.2.4 以代表點建立分群階層 18 3.3 演算法重點 20 第四章實作與實驗結果 22 4.1 實作 22 4.1.1 演算法實作 22 4.1.2 分群檢視工具實作 24 4.2 實驗 25 4.2.1 資料集的介紹 25 4.2.2 評量方式 26 4.2.3 實驗結果 28 4.2.3.1 實驗一 28 4.2.3.2 實驗二 28 4.2.3.3 實驗結果分析 31 4.2.4 分群實例 32 4.2.4.1 實例一 32 4.2.4.2 實例二 33 4.2.4.1 實例三 34 第五章結論與展望 36 5.1 結論 36 5.2 未來展望 37 參考資料 38599771 bytesapplication/pdfen-US統計模型階層式分群蛋白質序列statistical modelshierarchyclusteringprotein sequence以統計模型為基礎之複合式蛋白質序列分群演算法A Protein Sequence Clustering Algorithm Based on Statistical Modelsthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/53901/1/ntu-93-R91922093-1.pdf