摘要:隨著Illumina的新產品 (The HiSeq X 10) 於2014年年初問世,實現了低於1,000美元成本完成個人全基因體定序(Whole Genome Sequencing, WGS)的可能性,象徵個人基因體定序全面啟動的時代已經來臨,WGS資料分析預期將在個人化醫療(Personalized Medicine)與精準醫療(Precision Medicine)的落實過程中發揮關鍵角色。隨著越來越多研究單位公開數百或數千個個人基因體資料(如:千人基因組計畫http://www.1000genomes.org/)與本土Taiwan Biobank將釋出的全基因體定序(http://www.twbiobank.org.tw/new_web/dna.php),計算生物學開始面臨一個巨量資料的全新時代,所需面對的挑戰包括:如何從個體間大約只有0.1% ~ 1% 的序列變異(variant)解釋個體化差異? 又如何從一群人共有的序列變異瞭解人類疾病? 更重要的是如何在有限的運算資源下即時產生可供臨床醫師診斷時的重要參考資訊。在台灣,個人全基因體定序成本已經下降至10萬元台幣以下,然而,從基本的分析流程產生的變異點清單(Variant Call Format, VCF),到能提供臨床基因診斷所需的關鍵資訊,尚缺ㄧ些核心生物資訊技術。本產學合作計畫旨於於將國立台灣大學附設醫院基因體醫學部陳沛隆醫師累積多年的臨床基因診斷經驗,與國立台灣大學陳倩瑜博士在計算生物學研究領域長期累積的進階生物資訊演算法,透過共同開發的機制將關鍵技術轉移給本土生物資訊軟體公司。本計畫有三大目標:
1. 陳沛隆醫師所主持的子計畫一(臨床基因檢測)旨於解決以次世代定序進行臨床基因診斷過程中最具挑戰性的關鍵步驟;
2. 陳倩瑜博士所主持的子計畫二(轉錄調控註解)旨於建構轉錄因子結合位知識庫,尋找造成群體或個體差異之轉錄調控變異點;
3. 合作企業(亞大基因科技股份有限公司)旨於整合台灣本土參考基因組資料,提升標準流程變異偵測準確性。
亞大基因是一個擁有生物資訊和大數據分析技術雙重優勢的軟體研發團隊,致力於發展基於大數據及資料挖掘技術的前瞻性人類基因體序列分析運算技術,進而幫助醫院醫師提供更好的個人化醫療服務。本計畫所整合的跨領域研究團隊與所設定之執行內容與目標將有助於加速尋找造成人類疾病之基因序列變異,預期將對未來個人化醫學產生實質貢獻。
(子計畫一)
精準化醫療(precision medicine,又稱個人化醫療)是醫療的新模式,目標在提供客製化醫療。在所有精準化醫療立基的個人差異中,個別基因以及全基因組提供了最關鍵性的訊息。而精準化醫療最經典的幾個例子當然也就包括了基因診斷、遺傳諮詢(針對病患本身或其家人)以及根據基因檢測結果來決定治療選擇。在過去二十年,我們已經目睹了許多在基因體醫學的重要突破,包括人類基因體計劃、國際單套體圖譜計劃、DNA元件百科全書計劃、以及千人基因組計劃等等。而新的科技進步,如次世代定序(NGS,又稱高度平行化定序),藉助於它的高產出、全面覆蓋以及低成本,更是基因體醫學領域的革命性進步。然而,要實現個人化基因體醫學,尤其是在台灣,仍然有許多如下困難有待克服。首先:要將巨量原始數據分析轉換為在基因體位置上有意義之變異點資料並不容易;第二:要將醫學或是生理學之意義註解於這些變異點(尤其是在非表現子上變異點)上頭會是非常困難;第三:有些變異點,例如大片段結構變異、HLA基因以及CYP基因等等,其實驗以及資料分析上的困難度早就是惡名昭彰;第四:對於非白人民族(例如在台灣的我們),缺乏適切的同族群背景之各式變異點之族群盛行率,會使得正確的比對幾乎是不可能;第五:如果要讓個人化基因體醫學變得確實常規可行,那麼一個自動化的中文報告產生系統就真的是不可或缺。為了解決以上提及的問題,我們組成了一個合作團隊,結合了基因體醫學、生物資訊學、大數據以及軟體工程等等專才。在這個子計劃中,我們設定了七個目標:(1).確保檢體品管指標以及檢體正確性,(2). 正確分析藥物基因體醫學重要生物標記(如HLA基因及CYP基因),(3).針對一直以來非常困難的大片段結構變異(如大段缺損、大段插入、基因套數變異、翻轉以及轉位等)的分析,進行大幅改善,(4).將找到的表現子部位變異點,賦予正確的臨床意義,(5).將找到的非表現子部位變異點,賦予正確的臨床意義,(6).自動產生醫療人員容易看得懂的報告,以及 (7)可以自由選擇要產出中文版或是英文版的基因診斷報告。值得特別強調的是,Taiwan Biobank 目前授權的申請僅開放給學術研究申請。本研究也將遵循此一規範進行台灣本土基因體檢體之註解。只有當 Taiwan Biobank 授權可以將這些資料用於商業使用時(這的確是在 Taiwan Biobank 規劃的目標之中),我們才會授權給亞大基因(及其他可能商業用途)正式商業使用。在本計劃的第一年,就會產出一份軟體套組產品可供市場販售或訂購使用,可以用以分析最重要最常見的基因檢測結果。而在本計劃的第二年,我們會更進一步強化美化本軟體,更加入最先進深入的功能及分析能力。我們預計本軟體將獨霸台灣,而且在華文市場會有很大的市占率。即使是在英文市場,本軟體也會非常有競爭力。
(子計畫二)
本子計畫旨於從轉錄因子(transcription factor)的蛋白質序列變異(sequence variant)預測3D結構變異(structure variant),再從結構變異預測其所導致的蛋白質功能損益(loss of gain of function),進而影響其他基因的表現(gene expression)而生成疾病。本產學合作計畫之子計畫二旨於將本實驗室在計算生物學研究領域長期累積的研究成果,朝將關鍵技術轉移給本土生物資訊軟體公司之目標,邁出重要的ㄧ步。第一年首先針對過去開發的結構分析演算法與能量函式進行修改,用於預測蛋白質序列變異所造成其與DNA結合時之親和力(affinity)改變。本計畫將開發能因應序列變異調整蛋白質3D結構模型的計算方法,以期精準量化蛋白質結構改變所產生的功能損益。第二年將以本實驗室過去在尋找轉錄因子結合序列特徵(binding sequence motifs)的研究成果為基礎,建立人類蛋白質轉錄因子的標的基因(target gene)列表,搭配上述利用結構模型與能量函式所預測的親和力變化,評估該轉錄因子之蛋白質序列變異,對其下游基因產生的轉錄調控之影響。本計畫擬將多年所研發的計算方法,應用於尋找與人類疾病相關的序列變異;除了從疾病組與對照組的基因體中篩選出顯著序列變異進行親和力改變之預測外,並規劃於過程中利用千人基因組的序列資料(正常人),探討轉錄因子的序列變異是否與其結合的DNA序列變異在演化過程中產生共變異的現象,此現象的存在將得以解釋生物如何透過序列的共變異性,補償突變所產生的功能異常,適時在量化公式中反應其對基因表現的影響。本計畫所設定之執行內容與目標將有助於加速尋找造成人類疾病之基因序列變異,預期將對未來個人化醫學產生實質貢獻。
Abstract: The new product (The HiSeq X 10), announced by Illumina in the beginning of 2014, realized the possibility of completing personal whole genome sequencing with a cost lower than 1,000 USD. This achievement indicates that the era of pursuing personal genome for everyone has started. As more and more personal genomes released by research institutes (for example, the 1,000 genomes project, http://www.1000genomes.org/ and Taiwan Biobank, http://www.twbiobank.org.tw/new_web/dna.php), the computational biology society is confronted with the new world of big data. The challenges include how to explain the phenotype differences between two persons by the 0.1% ~ 1% sequence variants in the genomes? How to associate the commonly observed variants in a group of people with human diseases? An even important task is how the raw sequencing data can be transformed into valuable information for clinical genetic diagnosis efficiently using limited computation resources. This integrated project aims at transforming the core technologies developed in the lab of Dr. Chen, Pei-Lung and Dr. Chen, Chien-Yu during the past ten years into the key technologies highly required in current Bioinformatics products developed by local Bioinformatics companies. The main objective is incorporating the analysis pipelines developed in Dr. Pei-Lung Chen in the final stage of annotating personal genomes for the purpose of generating clinical genetic reports. Meanwhile, the key technologies from the lab of Dr. Chen, Chien-Yu will be the key features for non-coding region annotation in interpreting transcriptional regulation. This project plans to use the computational methods developed in the past years to discover significant sequence variants related to human diseases. Meanwhile, analysis on investigating the common variants will be performed on the 1,000 genomes from the worldwide projects and the Taiwan Biobank. This would be a very important step to translate the Taiwan Biobank data into valuable information required by the local Bioinformatics software in the final stage of generating clinical genetic reports. The proposed project content and objectives are expected to effectively accelerate the speed of discovering disease-related variants in genomes, resulting in valuable contributions in personal medicine in the near future.
(Sub-project 1)
Precision medicine, also known as personalized medicine, is a model of medical practice to provide customized healthcare. Among all the personal variations that precision medicine is based upon, genetics/genomics provides the most critical information. Genetic diagnosis, genetic counseling (for the patient and family members) and genetics-guided treatment are all classical examples of precision medicine. We have witnessed many big progresses in genomics during the latest two decades to make personalized genetic/genomic medicine possible, including the Human Genome Project, the International HapMap Project, the ENCODE Project and the 1000 Genome Project. Furthermore, next-generation sequencing (NGS), also known as massively parallel sequencing is revolutionizing medical genomics because of its high throughput, comprehensive coverage and low cost. However, there are many obstacles hampering the success of personalized medical genomics (especially in Taiwan). First, to convert huge amount raw sequencing data to genome-aware variants is not easy. Second, to assign correct medical/physiological meanings to those variants, especially non-coding variants, can be very difficult. Third, some variants, such as structural variations, human leukocyte antigen (HLA) genes and cytochrome P450 (CYP) genes, are notoriously challenging for wet laboratory experiments as well as for dry laboratory bioinformatics analyses. Fourth, lack of relevant allele frequency information from the same ethnic background for non-Caucasian populations (such as our people in Taiwan) makes reliable references almost impossible. Fifth, for personalized genomic medicine to become a daily practice, an automatic, Chinese reporting software suite is indispensable. To tackle the above-mentions problems, we form a team to combine the expertise of medical genetics/genomics, bioinformatics, big data and software engineering. There are seven specific aims of this sub-project, including (1). to perform quality control measurement and to secure sample correctness, (2). to analyze pharmacogenomics biomarkers (including HLA genes and CYP genes), (3). to improve the analysis of structural variations (including big deletion, big insertion, copy number variations, inversion, translocation, etc), (4). to assign clinical/physiological meanings to identified coding variants, (5). to assign clinical/physiological meanings to non-coding variants, (6). to automatically generate physician-friendly medical report, and (7). to have the choice to generate report in either Chinese or English. We would like to address that at current stage, Taiwan Biobank only opens to applications for academic use; we will strictly abide by the regulation. Commercialized usage of the data (which is clearly listed on the roadmap of Taiwan Biobank) will only be launched after we get the permission from Taiwan Biobank. By the end of the first year, we will have a profitable software suite covering the fundamental functions. In the second year of this project, we will further polish the software suite to further include most advanced functionalities and analyses. We expect this product to dominant Taiwan’s market, and to have a big share in Chinese-speaking world (including China). This software should also be competitive in English-speaking market.
(Sub-project 2)
This integrated project aims at transforming the core technologies developed in our lab during the past ten years into key technologies required by Bioinformatics products developed in local Bioinformatics companies. The main objective is incorporating the tools of predicting protein structure changes due to sequence variants in DNA in annotating personal genomes. The structure changes can then be used to predict potential loss or gain of the protein function. As a transcription factor (TF) is considered, the loss or gain of the protein function indicates the affinity change when it binds to its binding sites in DNA. Such affinity change of the TF upon binding DNA might affect the expression of its target genes, resulting human diseases. In the first year of the proposed project, we will modify the algorithms developed in the past few years for predicting protein-DNA binding affinity changes due to DNA sequence variants, in order to predict the affinity changes upon binding DNA due to protein sequence variants. In the second year, we plan to compile the target gene list for each human TF based on the computational methodologies developed in our lab for predicting TF binding sequence motifs. Along with the structure models and energy function described above, this project aims to develop computational methods for predicting the differential expression on the target genes of the TF. This project plans to use the computational methods developed in the past years to discover significant sequence variants related to human diseases. Meanwhile, an analysis on investigating the covariation between TFs and TF binding sites (TFBSs) will be performed on the 1,000 genomes from the worldwide projects and Taiwan Biobank, and the energy function and affinity prediction will be adapted to consider the presence of the covariation between TFs and TFBSs. The proposed project content and objectives are expected to effectively accelerate the speed of discovering disease-related variants in genomes, resulting in valuable contributions in personal medicine in the near future.