高成炎Kao, Cheng-Yen臺灣大學:資訊工程學研究所張育榮Chang, Yu-JungYu-JungChang2010-06-022018-07-052010-06-022018-07-052008U0001-3107200803523200http://ntur.lib.ntu.edu.tw//handle/246246/184801尋找與回溯不同生物基因體間在演化上之共同來源區段(稱之為演化同源與同線圖譜對映,synteny and orthology mapping),是比較基因體學中基礎的工作。隨著定序技術的進展,愈來愈多的大型基因體序列已經定序完成或近乎完成。這一方面使得以全基因體比對進行演化同源與同線圖譜對映顯得日益重要,另一方面也帶來了新的研究挑戰。面對為數眾多、隨時間分歧演化且動輒數十億萬鹼基對的基因體序列比對,我們要如何建立具備高靈敏度、高特異度以及高效率的比對引擎與方法是其中核心的研究課題。 我們首先針對近距大型基因體間同源與同線圖譜對映,發展出UniMarker方法。以人與小鼠比對為例,此方法採用長度16且在這兩個基因體都只出現一次的短序列來建立出次數頻譜,以偵測尋找同源與同線的基因體區段。實驗結果顯示,人與小鼠(基因體長度均為約三十億萬鹼基對)的基因體同源與同線對映只需數小時於一台個人電腦即能完成,同時其產出之圖譜與小鼠基因體定序協會(MGSC)之圖譜有99%的一致。 接著,針對非近距大型基因體間同源與同線圖譜對映,我們提出新型態的種子詞彙(seed),稱為maximal α-marker pairs(簡稱α-pairs),α代表該種子詞彙在兩個欲比對序列上之總出現次數的上限,這種選取方式有別於常見以限制種子詞彙長度而不考慮詞頻的選取方式,例如:採用固定長度的k-mer與設定長度下限的MEM方法。奠基於增強式後綴陣列(enhanced suffix arrays),我們提出了一個線性演算法來產生所有的α-pairs。根據人比對小鼠、雞與河豚的實驗結果,上述α-marker方法較之限制長度的方法(k-mer, MEM)在連續性匹配(contiguous matching)的同源種子詞彙選取(orthology seeding)上,能同時達成明顯較佳的靈敏度與較佳的效率。此外,我們更延伸此詞頻探索方法到非連續性匹配(discontiguous matching)的同源種子詞彙選取。從ROC曲線上的比較結果顯示,非連續性的wobble α-pairs明顯優於其他未限制詞頻之非連續性種子詞彙(spaced k-mer seeds)。Motivation: Orthology/synteny mapping—finding orthologous regions among genomes and organizing these evolutionary counterparts into a coherent global picture—is fundamental to studies of comparative genomics. With the increasing number of completely sequenced genomes and thus the increase in comparisons of massive nucleotide sequences, the need for orthology/synteny mapping methods of high sensitivity/specificity and high efficiency becomes even more compelling.esults: First we have developed the UniMarker (UM) method for synteny mapping of large genomes that are closely related, such as the human and mouse. In this method, the occurrence spectra of genome-wide unique 16mer sequences present in both the human and mouse genome are used to directly detected orthologous genomic segments. Being sequence alignment-free, the UM method is very fast and the high-quality human-mouse synteny maps based on DNA comparisons can be completed in a few hours on single desktop computer. Second, we propose a new type of DNA sequence seed for use in orthology mapping of not closely related genomes. We call our seeds α-pairs, where α is an integer equal to or greater than the number of times any qualifying seed can be found in the compared genomes. These copy number-based seeds are thus distinct from the well-known length-based seeds, such as the fixed-length k-mer seeds or the maximal exact match (MEM) seeds which have a length no less than k. We present a linear time algorithm to efficiently retrieve α-pairs in two given genomic sequences based on enhanced suffix arrays. A comparison of the results using α-pairs with those using length-based seeds for their ability to detect the orthologues annotated by Ensembl and COG for several vertebrate genomes/chromosomes and for prokaryote genomes of long evolutionary distances suggested that orthology seeding using copy number can achieve a higher sensitivity and better efficiency than orthology seeding using length. Moreover, we extend the α-pair method to generate discontiguous wobble seeds of maximal length with copy number constraints. The comparative results of ROC curves for human chr.15 vs. mouse chr.7, chicken chr.10, and pufferfish genome showed that the discontiguous wobble α-pairs achieved significantly better performances than spaced k-mer seeding methods tested.1 Introduction 11.1 Motivation 11.2 Dissertation organization 2 Background 32.1 Homology and synteny 3 2.1.1 Homology 3 2.1.2 Synteny 42.2 Index-based sequence comparison 6 The UniMarker method for synteny mapping 93.1 Introduction 93.2 Methods 12 3.2.1 pUMp vs. hUMp 12 3.2.2 Occurrence spectra of UMps and anchoring islands 13 3.2.3 Overlapped anchoring islands 16 3.2.4 Bidirectional mapping 18 3.2.5 Conserved segments and syntenic blocks 19 3.2.6 Comparison with other maps 19 3.2.7 BLASTZ evaluation 20 3.2.8 Software 213.3 Results 22 3.3.1 Maps from various versions of the human genome 22 3.3.2 Comparison with maps produced by MGSC and Ensembl 24 3.3.3 Evaluation with sequence alignment 28 3.3.4 Evaluation with LIS analysis of UMps 31 Copy number-based orthology seeding using contiguous matches 334.1 Introduction 334.2 Methods 37 4.2.1 α-markers and α-pairs 37 4.2.2 A linear time α-pair retrieval algorithm 39 4.2.3 Evaluation of orthology seeding 42 4.2.4 Datasets and software 444.3 Results 45 4.3.1 α-pairs vs. MEM or k-mer in vertebrate sequences 45 4.3.2 α-pairs vs. MEM or k-mer in prokaryote sequences 50 4.3.3 α-pairs vs. MUM or MAM 54 4.3.4 The number of α-pairs increases linearly with α 56 Extending α-markers/α-pairs to discontiguous seeding models 595.1 Introduction 595.2 Methods 60 5.2.1 Discontiguous α-markers and α-pairs 60 5.2.2 Evaluation of orthology seeding 645.3 Results 65 5.3.1 Comparisons of ROC curves for wobble-aware α-pairs/MEMs, spaced k-mer seeds and exact α-pairs/MEMs 67 5.3.2 Comparisons of colinear identities vs. total number of seeds for wobble-aware α-pairs/MEMs, spaced k-mer seeds and exact α-pairs/MEMs 75 Discussion and conclusions 806.1 Discussion 806.2 Conclusions 82ibliography 84 List of Publications 89application/pdf2051928 bytesapplication/pdfen-US比較基因體學演化同線對映演化同源對映序列比對後綴陣列comparative genomicssynteny mappingorthology mappingsequence alignmentseedingsuffix array詞頻探索方法用於高效率之基因體同源與同線圖譜對映Copy Number-Based Seeding Approaches to Efficient Orthology and Synteny Mapping in Genome Comparisonsthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/184801/1/ntu-97-D90922014-1.pdf