dc.description.abstract | The unsupervised classification methods, Clustering analysis and Factor analysis, intend to find meaningful structures existing in the observed attributes. These structures are usually expressed by grouping of attributes based on the similarities, or relationships among the attributes. However, the disadvantage of Factor analysis lies on insufficiency of full-rank in numerical computation. For example, in microarray data analysis, expressions of 10,000~20,000 genes are collected for each array. The number of genes is usually far larger than number of microarray. Clustering analysis, on the other hand, can help handle with a vast amount of attributes with few samples. There are some drawbacks of Clustering analysis, including of misapplying the correlation coefficient and the difficulties of evaluating the cluster quality as well as the determination of the cluster number.
In this research, we first discuss characterization of interrelationships among attributes, and then develop clustering methods suitable for grouping interrelated attributes. The “R2 with PCA” method lays more stress on the linear relationships between two clusters, while the “Variance explanation” method focuses not only on interrelations among attributes but also on attributes variations. This research also proposes the statistics for the evaluation of the cluster quality, and these statistics take into considerations the interrelationships among clusters and the variances explained of clusters. Finally, we apply these novel methods to two cases; one is 19 blood tests of 24 human; and the other is Down syndrome microarray data. | en |
dc.relation.reference | [1] Anderberg, M. (1973). Cluster Analysis for Applications. Academic Presses.
[2] Lewis-Beck, M.S. (c1994). Factor analysis and related techniques, London : Sage Publications
[3] Lin, W. T. (2004). Systematic data preprocess procedures and factor extraction of multiple phenotypes for one-color microarray, National Taiwan University.
[4] Eisen, M. B., Spellman, P. T., Brown, P.O. and Botstein, D.(1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad.Sci. USA 95, 14863-14868
[5] Rindflesch, T.C., Libbus, B., Hristovski, D., et al. (2003). Semantic Relations Asserting the Etiology of Genetic Diseases. Proc AMIA Symp, Submitted
[6] Heyer, L.J., Kruglyak, S., Yooseph, S.(1999). Exploring expression data: identification and analysis of coexpressed genes. Genome Research 9, 1106-1115.
[7] Milligan, G.W., Cooper, M.C. (1985). An examination of procedures for determining the number of clusters in data set. Psychometrika, 50:159—179
[8] Tibshirani. R, Walther. G, Hastie. T (2001). Estimating the number of clusters in a dataset via the Gap statistic.
[9] D’haeseleer, P. (2000). Reconstructing Gene Networks from Large
Scale Gene Expression Data. University of New Mexico.
[10] Belsley, D.A. et al. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley.
[11] Hotelling, H. (1933). Analysis of a complex of statistical variable into principal componets. J. Educ. Psysch., vol. 24, pp. 417-441.
[12] Jolliffe, I. T. (2002). Principal Component Analysis, 2nd Edition. Springer, New York
[13] Dudoit, S., Yang, Y., Callow, M.J., Speed T.P. (2000). Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat.sin., inn press.
[14] Tibshirani. R, Tusher. V, Chu. C. (2001). Significance analysis of microarrays applied to ionizing radiation response. Proceedings of the National Academy of Sciences. First published April 17, 2001, 10.1073/pnas.091062498. | en |