Options
Data Mining for the Classification Problem-The Inspiration of Genetic Programming
Date Issued
2006
Date
2006
Author(s)
Huang, Jih-Jeng
DOI
en-US
Abstract
With the rapid development of storage system technology, databases, data warehouses are widely employed by enterprises to extract useful information for applying supply chain management (SCM), enterprise resource planning (ERP), and customer relationship management (CRM). In order to effectively extract the useful knowledge hidden in the database/data warehouse, data mining technology is highlighted in the process of knowledge discovering in databases (KDD).
Data mining can be considered as the core of KDD and an iterative and interactive process to extract valid, nontrivial, and interesting information and knowledge from large among of data. The tasks of data mining can be divided into classification, regression, deviation detection, clustering, association rules, and sequential pattern. In this dissertation, the problem of data classification is highlighted.
The problems of the conventional classification models are considered to develop three models. These three models are proposed to incorporate the advantages of the discriminant-based and the induction-based methods based on the genetic programming method (GP).
The first model is to employ GP for building a classification model. The reasons which we employ GP to propose the classification model are that GP can automatically and heuristically determine the adequate discriminant functions and the valid attributes simultaneously. In addition, unlike artificial neural networks (ANNs) which are only suited for large data sets, GP can perform well even in small data sets.
The second model called the IF-THEN ruled genetic programming (IF-THEN GP) is based on the principle of “divide and conquer.” We can set a threshold of the cut to retrain the indiscernible data set to form the second discriminant function using GP and to obtain other discriminant functions in this order.
In order to combine the advantages of the discriminant-based and the induction-based methods, the third model we propose is two-stage genetic programming (2SGP). 2SGP integrates the function-based and the induction-based methods to form a hybrid model. First, the IF-THEN rules are derived using GP. Next, the reduced data are fed into GP again to form the discriminant function for providing the capability of forecasting.
In addition, we used two credit-scoring data sets to test the effectiveness of the proposed models and to compared with the conventional methods including multi-layer perceptron (MLP), classification and regression tree (CART), C4.5, rough sets, and logistic regression (LR). On the basis of the numerical results, we can conclude that the proposed methods outperform to other models and should be more suitable for the real-life classification problems.
Data mining can be considered as the core of KDD and an iterative and interactive process to extract valid, nontrivial, and interesting information and knowledge from large among of data. The tasks of data mining can be divided into classification, regression, deviation detection, clustering, association rules, and sequential pattern. In this dissertation, the problem of data classification is highlighted.
The problems of the conventional classification models are considered to develop three models. These three models are proposed to incorporate the advantages of the discriminant-based and the induction-based methods based on the genetic programming method (GP).
The first model is to employ GP for building a classification model. The reasons which we employ GP to propose the classification model are that GP can automatically and heuristically determine the adequate discriminant functions and the valid attributes simultaneously. In addition, unlike artificial neural networks (ANNs) which are only suited for large data sets, GP can perform well even in small data sets.
The second model called the IF-THEN ruled genetic programming (IF-THEN GP) is based on the principle of “divide and conquer.” We can set a threshold of the cut to retrain the indiscernible data set to form the second discriminant function using GP and to obtain other discriminant functions in this order.
In order to combine the advantages of the discriminant-based and the induction-based methods, the third model we propose is two-stage genetic programming (2SGP). 2SGP integrates the function-based and the induction-based methods to form a hybrid model. First, the IF-THEN rules are derived using GP. Next, the reduced data are fed into GP again to form the discriminant function for providing the capability of forecasting.
In addition, we used two credit-scoring data sets to test the effectiveness of the proposed models and to compared with the conventional methods including multi-layer perceptron (MLP), classification and regression tree (CART), C4.5, rough sets, and logistic regression (LR). On the basis of the numerical results, we can conclude that the proposed methods outperform to other models and should be more suitable for the real-life classification problems.
Subjects
知識發現
資料挖掘
分類模型
基因規劃
信用計分
Classification models
genetic programming
artificial neural networks (ANNs)
decision tree
rough sets
logistic regression
Type
other
File(s)
No Thumbnail Available
Name
ntu-95-D91725010-1.pdf
Size
23.31 KB
Format
Adobe PDF
Checksum
(MD5):05136d5dcab26fc130f8a18c9dcc319a