歐陽彥正臺灣大學:資訊工程學研究所蘇中才Su, Chung-TsaiChung-TsaiSu2007-11-262018-07-052007-11-262018-07-052007http://ntur.lib.ntu.edu.tw//handle/246246/54016在各式各樣的生物體中,有越來越多的蛋白質被發現存在有結構非穩定的區段,而且這些區段大部分是與其功能相關。之前的研究提出一個蛋白質的非穩定區段可以直接由其一級序列(即胺基酸排列順序)來做預測的假設,主要的原因是在非穩定區段有較多帶電性的胺基酸出現,而在穩定區段則通常是有較多疏水性的胺基酸出現。最近的研究更進一步指出採用演化相關的資訊(例如:位置加權距陣)可以提升預測蛋白質非穩定區段的準確度。在這個問題有越來越多機器學習相關技術和方法被採用的同時,試著去找出一些更有生物意義的特徵值是另一個值得研究的方向。 本論文首先分析依物化屬性分類的濃縮版位置加權距陣對預測準確度的影響,而所謂的濃縮版位置加權距陣是將一些具有相同性質的位置加權距陣中胺基酸的數值加總起來成為一個有此性質的特徵值。接著,從我們收集的蛋白質集合中分析二十種胺基酸分別是傾向於出現在穩定或非穩定區段。依照這樣的分析,可以將每一個傳統的物化性質分成具有穩定結構傾向及具有非穩定結構傾向的兩種特徵。由實驗結果看來,經由這樣分析的特徵在預測蛋白質非穩定區段會比只用傳統的物化性質來得好。雖然這樣的分析產生了更多的特徵,然而有些特徵卻只包含一種或兩種胺基酸而己,這些特徵擁有的資訊就不是很充分,為了要取得一個較有用的特徵集合,我們採用了一個混合式特徵選擇方法,試圖由單變數分析及多階段式特徵選擇方法來得到一組有用的特徵集合。相對於使用位置加權距陣或是依傳統物化性質的檂縮版位置加權距陣的特徵集合,在我們使用QuickRBF這個分類器時採用這組特徵集合可以得到較高的準確度。能從蛋白質序列來區分穩定或非穩定區段對於研究蛋白質結構及功能是很有用的,而我們所提出的方法也能有效地辦識出大部分的非穩定區段。不幸地,卻有太多誤認穩定區段為非穩定的問題產生。因此,我們提出二階段式的分類方法,稱之為DisPSSMP2,來降低原本方法錯認非穩定區段的問題,藉以提升預測的準確度。 為了要研究蛋白質非穩定區段的功能及可能的結構,在論文的最後介紹一個整合我們提出的方法及多種蛋白質序列分析軟體的網路服務,名叫iPDA。iPDA不但以圖型化的方式來顯示各種分析軟體的結果,還提供一個蛋白質序列保留性的樣式探勘軟體來偵測蛋白質中那些位置為潛在的接合作用區,用以協助從蛋白質序列角度來發掘其功能區段的研究。More and more disordered regions have been discovered in protein sequences, and many of them are found to be functionally significant. Previous studies reveal that disordered regions of a protein can be predicted by its primary structure, i.e. the amino acid sequence. One observation that has been widely accepted is that disordered regions are toward charged amino acids, while ordered regions usually have compositional bias toward hydrophobic amino acids. Recent studies further show that employing evolutionary information such as position specific scoring matrices (PSSMs) improves the prediction accuracy of protein disorder. As more and more machine learning techniques have been introduced to protein disorder detection, extracting more useful features with biological insights should attracts attention. This thesis first studies the effect of a condensed position specific scoring matrix with respect to physicochemical properties (PSSMP) on the prediction accuracy, where the PSSMP is derived by merging several amino acid columns of a PSSM belonging to a certain property into a single column. Next, we decompose each conventional physicochemical property of amino acids into two disjoint groups which have a propensity for order and disorder respectively, and show by experiments that some of the new properties perform better than their parent properties in predicting protein disorder. In order to get an effective and compact feature set on this problem, we propose a hybrid feature selection method that inherits the efficiency of uni-variant analysis and the effectiveness of the stepwise feature selection that explores combinations of multiple features. The results of the proposed experiments results show that the selected feature set improves the performance of a classifier built with Radial Basis Function Networks (RBFN) in comparison with the feature set constructed with PSSMs or PSSMPs that adopt simply the conventional physicochemical properties. Distinguishing disordered regions from ordered regions in protein sequences facilitates the exploration of protein structures and functions. However, the proposed predictor still suffers a large amount of false positives when facing real data. Therefore, we introduce a two-stage RBNF classifier, named DisPSSMP2, to improve the performance of DisPSSMP by reducing a large amount of false positives. This thesis finally presents the web server iPDA which integrates the proposed classifier with several other sequence predictors in order to investigate the functional role of the detected disordered region. In iPDA, a pattern mining package for detecting sequence conservation is embedded for discovering potential binding regions of the query protein, which is really helpful to uncovering the relationship between protein function and its primary sequence.Chapter 1 Introduction 1 Chapter 2 Related work of protein disorder 7 2.1 The extended central dogma of molecular biology 7 2.2 Intrinsically unstructured proteins (IUPs) 8 2.3 Databases of IUPs 9 2.4 Predictors of disorder 11 2.5 Overview of IUPs 18 2.6 Predictions for Calcineurin 26 Chapter 3 Methods 32 3.1 Datasets 32 3.2 Construction of PSSMP 34 3.3 Considering propensity for order or disorder 37 3.4 Classifier 41 3.5 Feature selection 41 Chapter 4 Results and discussions 44 4.1 Evaluation measures 44 4.2 Feature selection by cross-validation 46 4.3 Suggestion of window size 50 4.4 Results on testing data 50 4.5 Comparison with existing packages 53 4.6 Property-based sequential patterns 58 Chapter 5 A two-stage RBFN classifier 61 5.1 Enlargement of training sets 61 5.2 The modification of the classifier 64 5.3 Architecture of the two-stage classifier 65 5.4 Additional evaluation measures 66 5.5 Cross-Validation of DisPSSMP series 67 5.6 Results and discussions 68 Chapter 6 An application of protein disorder 75 6.1 Introduction 75 6.2 Methods 77 6.3 Results and discussions 84 6.4 Summary 89 Chapter 7 Conclusions and future work 91 references 931422368 bytesapplication/pdfen-US蛋白質非穩定區段預測蛋白質非穩定區段徑向基函數網路特徵選擇Protein disorderProtein disorder predictionRadial basis function networkFeature selection蛋白質非穩定區段之預測與特性研究Prediction and Characterization of Intrinsically Unstructured Proteinsthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/54016/1/ntu-96-D89922007-1.pdf