單色微晶片資料之系統化預處理與多重顯型之因子萃取

陳正剛臺灣大學：工業工程學研究所林雯婷Lin, Wen-TingWen-TingLin2007-11-262018-06-292007-11-262018-06-292004http://ntur.lib.ntu.edu.tw//handle/246246/51258為了瞭解基因組內含的資訊，目前微陣列晶片已被廣泛地用來作為觀測基因表現的工具。學界已經提出了很多的方法與機制來從微陣列晶片資料中萃取出有用的資訊；然而，原始資料的預處理卻決定了後續所萃取出來的資訊之可靠度與準確性。本研究的第一個目的在於提出一個系統化的預處理流程，來處理微陣列晶片原始數據。其包括3個步驟：強度讀值校正、訊號標準化與壞點篩選。在強度讀值校正步驟裡，使用變異係數(coefficient of variation)來評估原始數據之平均強度與中位數強度的一致性，以決定採用何者；接著檢驗前景強度與背景強度的相關性(correlation)，並矯正背景強度的影響。訊號標準化步驟分別使用對數轉換、減去中位數與除以變異數的方法轉化校正過的資料，以消除不同晶片間的亮度差異和對比差異。在訊號標準化之後，再以t假設檢定來篩除重複點(replicated spots)中的壞點。微陣列晶片的研究已不止於觀測基因與單一顯型(phenotype)的關係，目前更新的一項發展是同時觀測多重顯型。本研究的第二個目的在於運用因子分析(Factor Analysis, FA)方法來找出多重顯型間共有的獨立因子，然後將這些處理過的因子當作個別的顯型來分析，以找出有不同表現量的基因。本研究的二個目的都在於妥善地處理單色微陣列晶片的實驗數據，使得後續之生物資訊萃取分析能夠正確且有效率。最末，我們將上述的方法運用於一批觀測基因表現與19個顯型之間關係的實驗數據，來驗證並說明所提出的預處理方法。Microarrays are widely used to monitor gene expressions to yield information for genomes. Though there are many methods and mechanisms proposed to extract information from microarray data, the preprocess of raw expression data determine the accuracy and reliability of the extracted information. The first objective of this research is to implement a systematic procedure to preprocess the raw intensity reading. The proposed data preprocess procedure has 3 steps: rectification of intensity reading, signal normalization and bad spots screening. The rectification of intensity uses coefficient of variation (CV) to assess the consistencies of mean intensity and median intensity from raw intensity readings to decide which one to employ and then test the correlations between foreground intensity and background intensity to correct background intensity effects. Signal normalization transforms the rectified data to remove the chip-to-chip brightness variation and contrast variation by logarithm transformation, median subtraction and deviation division. After signal normalization, the hypothesis T-test is used to screen out bad expressions in replicated spots. More recently, microarrays have been conducted not only to relate genes with one phenotype, but also inquire relations between gene expression levels and multiple phenotypes. The second objective of this research is to apply Factor Analysis (FA) to extraction of the underlying co-regulating and independent factors of the multiple phenotypes. And then the treated factors can be taken as an individual phenotype for testing differentially expressed genes. Both of the objectives are to prepare experimental readings for accurate, effective biological information mining procedure. Finally, a real case of microarray experiment investigating gene expressions in 24 human blood samples with 19 phenotypes is provided to demonstrate and test the proposed preprocessing procedures.Abstract i Contents iii Contents of Figures iv Contents of Tables v Chapter 1: Introduction 1 1.1 Backgrounds and Motivation 1 Chapter2: Preprocess of Gene Expression Raw Data and Multiple Phenotypes Analysis 5 2.1 Rectification of Intensity Reading 6 2.1.1 Selection of intensity reading 6 2.1.2 Background intensity correction 7 2.2 Signal Normalization 12 2.2.1 Logarithm transformation 12 2.2.2 Brightness (location) normalization 14 2.2.3 Contrast (scale) normalization 16 2.3 Bad signal screening 17 2.4 Preprocess of Multiple Phenotypes 20 Chapter3: Case Study 27 Chapter4: Conclusions and Future Researches 38 References 39 Appendix A: Spot CV of median intensity and median intensity for selection of intensity in blood dataset. 42 Appendix B: More Details of Factor Analysis. 45459117 bytesapplication/pdfen-US微晶片預處理因子萃取preprocessmicroarray data analysisnormalizationmultiple phenotypes單色微晶片資料之系統化預處理與多重顯型之因子萃取Systematic data preprocess procedures and factor extraction of multiple phenotypes for one-color microarraythesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/51258/1/ntu-93-R91546029-1.pdf