利用蛋白質序列預測蛋白質作用區段

歐陽彥正臺灣大學：資訊工程學研究所蕭軍律Hsiao, Chun-LuChun-LuHsiao2007-11-262018-07-052007-11-262018-07-052006http://ntur.lib.ntu.edu.tw//handle/246246/53603蛋白質之間的交互作用在生物資訊學上扮演很重要的角色，特別是這交互作用可以調節巨分子複合物的形成，或是調節一系列的訊息傳遞過程。若要了解這些功能的分子層次及蛋白質網絡，確認這些個別的蛋白質交互作用區段就顯得很有意義。在後基因體時代，已知的蛋白質序列與日俱增但是相對解出蛋白質結構的速度卻遠遠比不上蛋白質序列的產生速度，所以從蛋白質的序列去預測蛋白質的作用區段是一個很有意義的題目，且在某些情況下，更能幫助蛋白質去建構其與其它蛋白質結合後的結構。近幾年來，有越來越多的研究是從蛋白質的序列來預測蛋白質的作用區段，而在不同種類的蛋白質作用區段裡，各自發展了許多不同的預測方法，但少有預測方法是針對不分類的作用區段來做預測，本論文主要探討如何在不對蛋白質作用區段作分類的情況下，從蛋白質序列去預測作用區段的胺基酸。我們研究了除了有關蛋白質在作用區段的胺基酸組成成份和其功能的保守性，另外加入了二級結構相關的資訊，經過機械學習的分類演算法─徑向基函數網絡，來做蛋白質作用區段的預測。最後，由實驗結果證明這些特徵集對預測蛋白質作用區段是有幫助的。Protein interactions play an important role in bioinformatics since they mediate the assembly of macromolecular complexes, or the sequential transfer of information along signaling pathways. To understand the molecular basis of these functions and of protein networks, it is important to identify the individual protein interface. In the post-genomics era, the known protein sequences are increasingly but the relative increasing rate of the solved protein structures is lower. Therefore, it’s meaningful to predict the protein interface from their sequences, and also the prediction can be benefit to construct the conformation of one protein which is bound to another under a certain conditions. In recent years, more and more studies focus on the prediction of protein interface from protein sequences directly. Many methods among these studies are developed according to the different kinds of protein interface. However, fewer methods are used to predict without classifying the protein interface into several categories. In my thesis, the compositions and conserved functions of the amino acids in the protein interface are studied and also the information of secondary structures is added. Last, we use radio basis function network to predict the protein interface and the results of experiment are also shown that these feature sets are useful for the prediction.Chapter 1 緒論 1 1.1蛋白質作用區段預測的興起 1 1.2 背景 2 1.3 研究動機 2 1.4 論文架構 3 Chapter 2 相關研究 4 2.1 作用區段特性 4 2.1.2 組成蛋白質作用區段的特徵 5 2.1.3 蛋白質作用區段的群聚效應 5 2.1.4 作用區段胺基酸的組成 6 2.2 預測作用區段的方法介紹 7 2.3 序列比對工具 8 2.3.1 BLAST 8 2.3.2 BLOcks Substitution Matrix(BLOSUM) 9 2.3.3 PSI-BLAST 9 2.4 FSSP 資料集(dataset) 9 2.5 DSSP 10 2.6 .機器學習演算法種類 10 2.6.1 Radio basis Function Network 10 2.6.2 支援向量機(support vector machine) 12 2.6.3 KNN 14 2.6.4 決策樹(Decision tree) 14 2.7 二級結構 15 2.7.1 二級結構(Secondary Structure) 15 Chapter 3 預測蛋白質作用區段的方法 17 3.1 目標 17 3.2 機械學習(machine learning) 17 3.2.2 分類法的基本概念 17 3.2.3 交叉驗證(cross validation) 18 3.3 資料集 18 3.4 作用區段的定義 19 3.5 特徵(Feature) 19 3.5.1序列辨識 20 3.5.2 PSI-BLAST 20 3.5.3非表面胺基酸特徵 (Non-surface feature) 21 3.5.4 胺基酸的特性(Property) 22 3.5.5保守性(Conservation) 25 3.5.6二級結構 26 3.6 輸入資料的處理 27 3.6.1 Sampling 27 3.6.2 視窗大小(Window size) 27 3.7 特徵選擇(Feature selection) 28 3.8 分類法 28 3.9 預測結果評估方法 28 Chapter 4 實驗結果 30 4.1 實驗一：預測蛋白質作用區段(整個蛋白質) 30 4.2 實驗二：序列和R-PSSM的比較 32 4.3 實驗三：非表面胺基酸資訊 33 4.3.1 實驗 3-1 33 4.3.2 實驗3-2 34 4.4 實驗四: 胺基酸特性 36 4.4.1 單一特性分析 36 4.4.2 特性的傾向 38 4.4.3 多特性分析 39 4.4.4 特性加入NSR-PSSM的特性 40 4.5 胺基酸保守性和二級結構 42 4.6 結果 43 Chapter 5 結論與展望 45 5.1結論 45 5.2 展望 46 參考文獻 48540457 bytesapplication/pdfen-US作用區段interface residues利用蛋白質序列預測蛋白質作用區段Identification of Interface Residues Based on PSSM Profile and Biochemical Propertiesthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/53603/1/ntu-95-R93922133-1.pdf