基於結合模序配對探勘之蛋白質交互作用預測

歐陽彥正臺灣大學:生醫電子與資訊學研究所游棨元Yu, Chi-YuanChi-YuanYu2010-05-262018-07-052010-05-262018-07-052009U0001-0307200910243500http://ntur.lib.ntu.edu.tw//handle/246246/184196蛋白質交互作用是生物體執行功能的基礎。透過蛋白質交互作用的研究，可以理解細胞運作的基本原理，進而開發、設計藥物，並針對疾病進行治療，因此了解蛋白質的交互作用不論在基礎或臨床的研究上都非常重要。由於以生物實驗驗證蛋白質交互作用需要耗費過多的時間與金錢，因此開發計算方法輔助，減少研究蛋白質交互作用消耗的資源，為現今系統生物學研究的首要工作之一。算方法中，「以模序為基礎 (motif-based)」的方法利用樣式探勘 (pattern mining) 演算法找出結合模序 (binding motif)，再透過樣式比對 (pattern matching) 來預測蛋白質交互作用。其優勢在於只需要蛋白質序列即可進行分析與預測，而且可以得知蛋白質產生交互作用之區域。傳統以模序為基礎之方法大多是利用功能序列在同源家族蛋白質中具有高度保留性的原理，針對同源蛋白質序列進行樣式探勘，試圖挖掘出結合模序。不過以此方法所找出的具保留性之序列不一定與交互作用有關，可能是用來維持結構或其他功能。為了克服這個問題，近年來有學者提出應該針對具備相同交互作用機制的蛋白質進行樣式探勘，並提出透過偵測全對全作用網路 (all-versus-all interaction network) 的方式來蒐集探勘樣式所需之蛋白質。然而，近年來已發表的研究中並沒有將這套方法所探勘之結合模序運用於蛋白質交互作用之預測，以精確評估這種機制的效果以及採用不同樣式探勘演算法的效應。本研究提出以一個先進的蛋白質序列樣式探勘演算法為基礎，搭配全對全作用網路，以進行蛋白質交互作用預測。此蛋白質序列樣式探勘演算法之特色在於能夠探勘出由數個短樣式組合而成的長樣式，這種特性相當符合蛋白質結合介面 (binding interface) 通常是由許多序列片段所組成之特性。本研究的結果顯示，樣式探勘演算法對於結合模序挖掘與蛋白質交互作用預測影響皆非常顯著。本研究所採用的先進演算法，不論結合模序之正確性以及蛋白質交互作用預測之準確率都較其他方法來得優越。Protein-Protein interactions (PPIs) are essential to various biological functions in living organisms. Studying PPI not only provides critical clues for understanding how a cell operates but also may lead to development of advanced diagnoses and therapies. In this regard, as it requires huge amounts of time and resources to confirm protein-protein interactions with molecular biology experiments, design of computational approaches to predict possible protein-protein interactions is of scientific significance for advances in systems biology. One existing approach to predict protein-protein interactions is based on the binding motifs extracted by pattern mining algorithms. Motif-based approaches are favored by biologists who want to conduct in-depth analyses on how the concerned proteins interact, instead of just knowing whether these proteins interact with each other or not. With respect to motif-based prediction of protein-protein interactions, there exist two major categories of approaches. One category of approaches simply resorts to analysis of the polypeptide sequences, while another category of approaches further refers to the tertiary structures of proteins. As the availability of the tertiary structures of proteins is still limited to certain groups of proteins, sequence-based approaches are more generally applicable. The conventional motif-based approaches extract binding motifs through identifying evolutionally conserved regions in polypeptide sequences. However, evolutional conservation is just a necessary condition and is not a sufficient condition for presence of interaction sites. Certain regions in a protein chain may be conserved in order to maintain a conformation. Therefore, in recent years, researchers have proposed a novel approach to identify protein-protein interaction motifs through analysis of interaction networks. Nevertheless, latest studies did not report a comprehensive analysis on the quality of the interaction motifs identified, let alone the effects with alternative pattern mining algorithms. The study reported in this thesis has followed the recent development and has employed a state-of-the-art pattern mining algorithm to deliver superior performance in identifying protein-protein interaction motifs. The most distinctive feature of the pattern mining algorithm employed in this study is its capability in identifying patterns composed of several short gapped segments. Experimental results reveal that the predictor designed in this study really outperforms the predictors that incorporate other pattern mining algorithms.論文口試委員審定書 I謝 II要 IIIbstract V錄 VII目錄 IX目錄 X一章緒論 1二章相關研究 5.1 蛋白質及其交互作用 5.2 蛋白質交互作用資料庫 7.3 預測蛋白質交互作用之計算方法 8.3.1 基因體法 9.3.2 演化關聯性法 10.3.3 蛋白質結構法 11.3.4 功能區塊法 12.3.5 蛋白質一級結構法 13.3.5.1 以分類器為基礎之方法 14.3.5.2 以模序為基礎之方法 14三章研究方法 16.1 資料集 16.2 本研究提出之方法 16.2.1 全對全作用網路探勘 17.2.2 交互作用結合模序配對挖掘 21.2.2.1 Wildspan 22.2.2.2 Protomat 23.2.2.3 Pratt 24.2.3 交互作用蛋白質配對比對 25四章實驗結果與討論 26.1 本方法蛋白質交互作用預測效能與探勘模序正確性之評估 26.1.1 蛋白質交互作用預測效能評估 26.1.2 探勘模序配對正確性分析 29.2 同源蛋白質與偵測全對全作用網路兩種蒐集訓練集方法之比較 33.2.1 同源蛋白質與偵測全對全作用網路蒐集蛋白質組之差異 34.2.2 蛋白質交互作用預測效能之比較 34.2.3 探勘模序配對正確性分析 35.3 預測蛋白質配對具有交互作用之信賴程度分析 37五章結論與未來展望 40.1 結論 40.2 未來展望 40考文獻 41amp;#8195;目錄一全對全作用網路 3 二分子生物學中心法則 5 三基因體法 10 四演化關聯性法 11 五蛋白質結構法 12 六功能區塊法 13 七蛋白質一級結構法 15 八本研究提出方法之流程 18 九全對全作用網路探勘 19 十頻繁項目集探勘演算法 (Apriori) 20 十一交互作用結合模序配對挖掘 22 十二 Wildspan 產生之樣式 23 十三 Protomat 產生之樣式 24 十四 Pratt 產生之樣式 24 十五交互作用蛋白質配對比對 25 十六蛋白質配對與全對全作用網路之關聯性 30 十七未位於訓練集之蛋白質交互作用配對距離分數之分佈 31 十八結合模序配對於蛋白質複合體1U7F上之位置 32 十九同源蛋白質所探勘出之模序配對於蛋白質複合體1U7F上之位置 37 二十基因知識體分數信賴度分佈 38目錄一二十種胺基酸 6 二驗證交互作用之生物實驗方法 7 三蛋白質交互作用資料庫 8 四資料集 16 五不同資料集與參數下所產生之訓練集 27 六不同樣式演算法之蛋白質交互作用預測效能 28 七組內平均序列相似度 34 八同源蛋白質與偵測全對全作用網路蒐集訓練集之蛋白質交互作用預測效能 36application/pdf2423131 bytesapplication/pdfen-US蛋白質交互作用樣式探勘結合模序蛋白質序列全對全作用網路.protein-protein interactionpattern miningbinding motifprotein sequenceall-versus-all interaction network.基於結合模序配對探勘之蛋白質交互作用預測Predicting Protein-Protein Interactions with a Network-based Motif Minerthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/184196/1/ntu-98-R96945017-1.pdf