分子間三維子結構之比對與其應用於蛋白質功能分類

歐陽明臺灣大學：資訊工程學研究所陳建成Chen, Chien-ChengChien-ChengChen2007-11-262018-07-052007-11-262018-07-052004http://ntur.lib.ntu.edu.tw//handle/246246/54066藉由實驗的技術越來越多的蛋白質可取得其三維結構的資料，例如，著名的蛋白質資料庫PDB (Protein Data Bank)目前已儲存了超過 25,000 個蛋白質的三維結構，而且數量與日俱增。同時，也有很多分析分子結構與視覺化的工具被開發出來。然而，人們所關心的是：這些蛋白質的功能是什麼呢？是否與其它蛋白質具有類似的活化區域呢？蛋白質的功能與活化區域有著高度的相關性，蛋白質纏繞成複雜的三維結構，藉著其活化區域適當地與其他分子交互作用來實現它的功能。因此，如何藉由蛋白質的三維結構，進而得知蛋白質的功能及活化區域的位置，已變成一個研究上重要而受人矚目的問題。而面對這些課題，我們將它轉換成蛋白質三維部份結構比對的問題來解決。在本篇論文中，我們提出蛋白質三維子結構比對的技術與其應用。首先，我們提出一個兩階段以幾何雜湊 (Geometric Hashing) 為基礎的分子間(不限於蛋白質)相似性比對架構，此第一階段乃運用【幾何雜湊演算法】(Geometric Hashing Algorithm) 在分子間做整體性初步地配對，而第二階段乃運用【疊代式最近點演算法】(Iterative Closest Point Algorithm) 做局部的最佳化微調，直到配對的原子對在某一門檻距離下，不再增加為止。此架構用於分子間的比對，對於蛋白質EFG與非蛋白質EF-tu/tRNA複合物、蛋白質RRF與tRNA等比對的結果都相當好。另外，我們也與目前四個最常用且相互競爭的方法 (Yale, Dali, CE, Lund) 做比較，結果顯示以RMSD和相配對的原子數為標準來衡量，我們的方法是最好的，然而，我們的方法需要較多的計算時間。此外，運用這個比對架構，我們可以預測蛋白質可能的功能及活化區域，其主要的想法是：兩個蛋白質間即使序列不相似，但是，如果兩者之與功能息息相關的活化區域具有極其相似的三維幾何形狀，則它們仍然可能有類似的功能。我們建立了一個活化區域的資料庫，對一個有興趣知道功能的白質，我們將之與活化區域資料庫中的所有活化區域做相似性比對。我們的方法最終會選出一個在蛋白質中有最相像結構的活化區域，而這個活化區域的原始功能類似，即是我們預測這個蛋白質所應該擁有的功能。此外，蛋白質比對出來的最相像子結構，則是活化區域的可能位置。依據實驗結果顯示，藉著一些鑑別器的輔助，保守地估計酵素分類第四類的正確率約為42.12%，第五類約為79.06%，第六類約為60.78%。此外，我們也估算出比較高的正確率上限值。憑藉著這些研究的成果，我們期望能夠幫助生化學家儘快地測知未知蛋白質的功能，甚至發現已知蛋白質的新功能。With the explosion of protein sequences and structures storing into databanks, it is highly desirable to explore feasible alignment and classification methods for newly found protein enzymes classified into their respective enzyme classes by means of an automated procedure. This is indeed important because knowing which class an enzyme belongs to may help deduce its catalytic mechanism and specificity, giving clues to relevant biological functions. Traditional experimental methods are both time-consuming and costly. However, these requirements can be met by transforming them into a 3D common substructure matching problem among proteins. In this dissertation, I have proposed a two-pass geometric hashing based framework, the Geometric Hashing and Iterative Closest Point algorithms, to handle the 3D substructure matching problem of proteins. Two techniques, alpha shape and 3D reference frame schemes for feature extraction, are used to reduce the computation cost. Our methods aligned two molecules (not just proteins) based on 3D structural data. Two main experiments are conducted based on the data from the PDB. The first is to solve the molecular alignment problem, where, for example, the similarity between a protein EFG and a non-protein EF-tu/tRNA complex is calculated. We also compare with four popular tools (Yale, Dali, CE and Lund) with six protein pairs, and the results show that our method improves over the others in terms of RMSD and the number of matched atom pairs. However, our method is computationally more expensive. In the second experiment, which is also an innovative application, active site residues in a protein can be aligned and matched against other proteins with different enzyme classification numbers. With the consideration of atom type and binding surface area as discriminators, this method can reach an accuracy rate of 42.12% for enzymes classified in EC4, and 79.06% for EC5, and over 60.0% for EC6 conservatively, and a higher upper bound for accuracy rate is also evaluated. Both experiments demonstrate that the proposed methods are useful and versatile.Content 中文摘要 1 ABSTRACT 2 CHAPTER 1. INTRODUCTION 3 1.1 MOTIVATION 5 1.2 THESIS STATEMENT 7 1.3 CONTRIBUTIONS 8 1.4 PROTEIN STRUCTURES AND NUCLEIC ACIDS 9 1.5 ORGANIZATION 13 CHAPTER 2. RELATED WORK 14 2.1 SUBSTRUCTURE IDENTIFICATION OF PROTEINS 15 2.2 SEQUENCE ALIGNMENT 17 2.3 STRUCTURAL ALIGNMENT 19 2.4 PUBLIC DOMAIN ALIGNMENT TOOLS 21 CHAPTER 3. ALIGNMENT OF PROTEINS BASED ON 3D STRUCTURAL DATA 23 3.1 OVERVIEW 23 3.2 GEOMETRIC HASHING ALGORITHM 26 3.3 ALPHA SHAPE ALGORITHM 30 3.4 3D REFERENCE FRAME 32 3.5 ITERATIVE CLOSEST POINT ALGORITHM 35 3.6 THE MOLECULAR ALIGNMENT PROBLEM 37 CHAPTER 4. 3D SUBSTRUCTURE MATCHING OF PROTEINS 45 4.1 OVERVIEW 45 4.2 ACTIVE SITE RETRIEVAL 50 4.3 PROTEIN FUNCTION CLASSIFICATION AND PREDICTION 54 4.4 ONE WAY TO IMPROVE THE ACCURACY : THE MINIMAL BINDING SURFACE PROBLEM 62 4.5 DISCRIMINATORS FOR PROTEIN FUNCTION CLASSIFICATION AND PREDICTION 67 CHAPTER 5. CONCLUSION AND FUTURE WORK 73 5.1 CONCLUSION 73 5.2 FUTURE WORK 75 REFERENCES 76 APPENDICES 82 A. THE EXPERIMENTAL DATA OF PROTEIN FUNCTION CLASSIFICATION AND PREDICTION 82 A.1 A Summary of the Numbers of Site Structure and Test Protein 82 A.2 A Detailed List of Test Proteins 86 A.3 A Detailed List of Site Database 913259772 bytesapplication/pdfen-US活化區域幾何雜湊法疊代式最近點演算法結構比對蛋白質功能分類structure alignmentprotein function classificationactive sitegeometric hashingIterative Closest Point分子間三維子結構之比對與其應用於蛋白質功能分類On 3D Substructure Matching of Molecules and Its Applications in Protein Function Classificationthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/54066/1/ntu-93-D84526008-1.pdf