蛋白質最大序列樣式探勘演算法

李瑞庭臺灣大學：資訊管理學研究所凌宇Ling, YuYuLing2007-11-262018-06-292007-11-262018-06-292005http://ntur.lib.ntu.edu.tw//handle/246246/54357蛋白質序列中的序列樣式，和蛋白質所執行的功能有著密不可分的關係。因此，如何從蛋白質序列資料庫中，透過系統化的方法，探勘出重要的序列樣式，已成為一個相當重要的研究課題。針對此一課題，本論文提出了一個以字尾樹(suffix tree)為基礎的演算法。演算法中的所有過程，包括探勘序列樣式中的封閉性高頻子字串(closed frequent substring)，將封閉性子字串組成最大高頻序列樣式(maximal frequent sequential pattern)，以及調整序列樣式中子字串間的間隔(gap)等，皆可利用字尾樹中所記錄的發生資訊(occurrence information)來完成。而為了確保序列樣式的精簡性，我們的演算法刪減了不必要的序列樣式，僅保留最大序列樣式。由實驗的結果顯示，我們的演算法不僅能夠找出PROSITE資料庫中所記錄的序列樣式，並且還發現了其他一些值得提供生物學家進一步研究的結果，例如更長的序列樣式，及分類樣式集合(classifier pattern set)等。另外，我們演算法在實驗中，也展現了較 Chang and Halgamuge 的方法更為優良的結果。Because of the close relationship between sequential patterns and protein function, systematically mining significant sequential patterns in protein databases has become an important research topic. In this thesis, we proposed a suffix-tree-based algorithm to discover patterns in protein databases. We use the occurrence information maintained in the suffix tree to mine closed frequent substrings, generate maximal frequent sequential patterns, and adjust the gaps within the patterns. To ensure the compactness of the patterns we generate, we do not generate all patterns but only maximal patterns. From the experimental results, our proposed algorithm can find not only the patterns recorded in PROSITE database, but also some other patterns worth of further biological studying, such as longer patterns and the classifier pattern set. Besides, our proposed algorithm generates better results than those of Chang and Halgamuge’s method in the experiment.Table of Contents i List of Figures ii List of Tables iii Chapter 1 Introduction 1 Chapter 2 Literature Survey 3 2.1 Sequential mining methods 3 2.2 Feature-based methods 3 2.3 Mathematical methods 6 2.4 MSA-based methods 6 2.5 Chang and Halgamuge’s algorithm 7 2.6 Discussion 8 Chapter 3 Our Proposed Algorithm 10 3.1 Term and notation description 10 3.2 Problem definition and algorithm overview 12 3.3 Finding all closed frequent substrings 14 3.4 Finding all maximal frequent patterns 21 3.5 Gap adjustment 23 Chapter 4 Experiments and Performance Evaluation 28 4.1 Experiments on real data 28 4.2 Comparisons with PROSITE database 33 4.3 Comparisons with Chang and Halgamuge’s algorithm 38 Chapter 5 Conclusions and Future Work 40 References 42689972 bytesapplication/pdfen-US蛋白質最大序列樣式字尾樹proteinmaximal sequential patternsuffix tree蛋白質最大序列樣式探勘演算法Mining Maximal Sequential Patterns in Protein Databasesotherhttp://ntur.lib.ntu.edu.tw/bitstream/246246/54357/1/ntu-94-R92725020-1.pdf