歐陽彥正Oyang, Yen-Jen臺灣大學:資訊工程學研究所簡廷因Chien, Ting-YingTing-YingChien2010-05-182018-07-052010-05-182018-07-052008U0001-2608200823040700http://ntur.lib.ntu.edu.tw//handle/246246/183647大規模地以非人工的方式註解蛋白質的功能或序列特徵(signature),在後基因時代仍然是一項大挑戰,在此論文中,我們利用蛋白質的序列特徵設計一個預測方法,預測酵素序列的催化部位(catalytic sites)。我們的方法利用模體(motif)探勘的方式產生蛋白質序列特徵,每個序列特徵包含了幾個重要的殘基區塊,這些區塊也稱為保留性區塊(conserved segments),這些保留性區塊在同源序列上常常一起出現,它們在演化過程中被小心地保留下來,表示這些區塊有一定的重要性。依照生物實驗結果,酵素的催化殘基通常分散在蛋白質序列的不同區域,因此若要完整的預測催化殘基部位,產生的序列特徵也必須分散在蛋白質序列的不同區域。在本論文中,我們蒐集Catalytic Site Atlas (CSA)資料庫中的催化殘基資訊來評估我們所提出的預測方法之效能。測試結果顯示,我們的方法比PROSITE資料庫中的模板更能夠辨識催化部位和催化殘基。本論文將此研究方法實作成E1DS網站(http://e1ds.csbb.ntu.edu.tw/),E1DS目前有5421個序列特徵,這些序列特徵總共涵蓋932個4碼EC編號 ( numbers)。平均而言,在預測催化位置上,E1DS的正確率(correct)達到35.5%;成功猜測率(success rate)達到49.6%,而PROSITE的正確率及成功猜測率分別為18.9%及33.7%,在預測催化位置這部分,E1DS的正確率和成功猜測率均表現的比PROSITE理想。在預測催化殘基部分,E1DS的靈敏度(sensitivity)為30.0%,比PROSITE (16.2%)來得要好,但就明確度(specificity)而言,E1DS (96.7%)表現的比PROSITE (98.6%)來得差。Large-scale automatic annotation for protein sequences remains challenging in post-genomics era. This thesis aims at predicting catalytic sites of enzyme sequences based on a repository of protein signatures. The employed sequence signatures are derived from a motif based method. The blocks of a signature, also called conserved regions, are composed of the key residues found among the homologues. These blocks are conserved during evolution because of their importance in protein functions. Biological experiments reveal that an enzyme catalytic site is usually constituted of residues that are largely separated in the sequence. To predict catalytic sites comprehensively, it is expected that the employed signatures must contain residues that are largely scattered in sequence. In this regard, we employ a recently developed pattern mining algorithm WildSpan for generating enzyme sequence signatures. WildSpan is well designed for discovering sequence motifs spanning a large number of unimportant positions. To measure the performance of our method, we collect the annotated catalytic sites for 831 enzymes from Catalytic Site Atlas (CSA). The results reveal that our method performs more effectively in identifying catalytic sites and catalytic residues than the patterns derived from PROSITE database. The proposed method has been realized in a web server named E1DS (http://e1ds.csbb.ntu.edu.tw/). E1DS currently contains 5421 sequence signatures that in total cover 932 4-digital EC numbers. In average, on the task of predicting catalytic sites, E1DS achieves a ‘correct’ rate of 35.5% and a ‘success rate’ of 49.6%, while the ‘correct’ and ’success’ rates of using PROSITE patterns are 18.9% and 33.7% respectively. On the other hand, on the task of predicting catalytic residues, the sensitivity rate of E1DS is 30.0%, better than that of PROSITE (16.2%), though the specificity rate of E1DS (96.7%) is slightly worse than that of PROSITE (98.6%).誌謝 i文摘要 ii文摘要 iii錄 v目錄 vii目錄 viii一章 緒論 1二章 相關研究 5.1 預測功能殘基 5.2 序列比對演算法 13三章 方法 16.0簡介 16.1資料蒐集 17.2序列特徵建構 18.3評估序列特徵 19.4預測方法 21四章 實驗 24.1 催化殘基資料集 24.2效能評估 26五章 網站 29.1首頁 29.2結果頁面 30.3錯誤訊息 34六章 結論 36考文獻 37application/pdf1043043 bytesapplication/pdfen-US蛋白質序列探勘催化部位序列特徵EC編號酵素功能Sequential pattern miningCatalytic siteSignatureEC numberEnzyme function利用序列特徵探勘預測酵素催化部位Prediction of enzyme catalytic sites by sequential pattern miningthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/183647/1/ntu-97-R95922108-1.pdf