生物資源暨農學院: 農藝學研究所指導教授: 蔡政安曾禹翔Zeng, Yu-ShiangYu-ShiangZeng2017-03-062018-07-112017-03-062018-07-112015http://ntur.lib.ntu.edu.tw//handle/246246/275909近來隨著次世代定序技術發展愈來愈快速以及日趨成熟,這項科技 已經在各個領域廣泛的被使用到,如醫學、農業、生物科技等等。次世 代定序技術可以用來做全基因體定序,也可以將一些已知的物種重新 定序,更可以探討在生物性上的理論,而其中一項重要的應用就是轉 錄體定序(RNA-seq) 資料。轉錄體定序資料常被用來檢定基因表現量, 近年來,轉錄體定序資料已漸漸取代微陣列資料(Microarray) 成為研究基因表現量的一個指標。然而在探討轉錄體定序資料時,由於它是屬 於離散型變數,且資料會發生變異數大於平均值的現象,這種現象我 們稱作過度離異(over-dispersion)。我們通常會用負二項分配(Negative Binomial Model) 解決過度離異問題,但如何估計模型中的參數,這其 中又牽涉到許多統計方法。近來常見的如DESeq、edgeR 跟DSS 都是 在分析上常用的方法。但這幾種方法都是用點估計來估計參數,並沒 有將不確定性考慮進去。在本論文中,我們建立了兩個模型,分別為 對數線性模型,以及貝氏階層模型,利用馬可夫鏈蒙地卡羅(MCMC) 的方法得到我們有興趣的參數,進而可以找出表現量不同的基因。最 後我們分別利用模擬資料以及實際資料來評估DESeq、edgeR、DSS 以 及我們方法的好壞。其中我們發現當各組的重複數接近甚至相同的時 候,我們的線性對數模型相較於其他方法是表現較好的;而當重複數 如果是極端不平衡的情況之下,我們會建議利用中位數估計法來進行 檢定。With the rapid development of Next Generation Sequencing technology, plenty of industries such as medical science, agriculture and bio-technology are taken to the next level. Next Generation Sequencing technology makes whole genome sequencing and de novo sequencing possible to explore the biology-based theory; besides, RNA-seq data is one of the core applications of Next Generation Sequencing technology. RNA-seq data is to obtain the gene expression level and to test whether specific gene is differentially expressed. Recently, RNA-seq data has replaced Microarray technology and becomes the important benchmark of gene expression test gradually. However, because of the discrete RNA-Seq read counts, the phenomena of over-dispersion (the variance of the data is larger than the mean) will occur. To deal with over-dispersion problem, negative binomial model is applied; however, the parameter estimation is another issue to be considered. Nowadays, some analysis softwares for RNA-seq data like DESeq, edgeR and DSS only use point estimation to obtain the parameters without considering the uncertainty in RNA-seq data. Here, we use Markov chain Monte Carlo (MCMC) method to obtain the estimates of parameters that it may be concerned with detecting the differentially expressed genes. In the end of the thesis, we compare the performance of DESeq, edgeR, DSS and our method by both simulated and real RNA-seq data. Our log-linear model performs much more superior than DESeq, edgeR and DSS while the replicates between groups are close or same. Besides, when the number of replicates between groups is extremely unbalanced, then we suggest that median estimator would be the proper method for detecting differentially expressed genes.4843119 bytesapplication/pdf論文公開時間: 2015/8/25論文使用權限: 同意有償授權(權利金給回饋本人)轉錄體定序資料基因表現量貝氏分析對數線性模型RNA-seqGene expressionBayesian inferenceLog-linear model以貝氏分析方法來偵測轉錄體定序資料之顯著基因Identification of Differentially Expressed Genes of RNA-Seq Data based on Bayesian Approachesthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/275909/1/ntu-104-R02621208-1.pdf