A Heuristic Data Sampling Approach for Data Mining Preparation under Big Data Environment
Date Issued
2014
Date
2014
Author(s)
Peng, Huai-De
Abstract
Big data has been a greatly popular topic among industries and academics. It contains several characteristics which are extreme scale, various data sources and incremental. These characteristics make big data harder to be analyzed while classic data mining techniques are highly possible to be infeasible. Especially, the processing time may not be efficient enough to generate analytic results in time due to the characteristics of big data. Furthermore, it could fail to generate any result since the number of objects and attributes are too large.
In this study, we use classification based on association rules as our data mining technique. Under the premise of not changing existing data mining method, we try to solve the problem of big data by data preparation, integration and evaluation.
The algorithm we proposed separates to two parts. The first part is a heuristic sampling method at the initial phase. Samples the data that is representative to the population of big data and then selects attributes which are important and discriminative. The sampling result can be further applied to following data mining techniques. For the purpose of handling different class distributions, we can apply undersampling method for some specific rare class to generate corresponding rules. The second part is dealing with incremental problem. Using the sampled data of initial phase from both the preliminary data and the incremental data and their classifiers, we merge the data and apply these data to verify the combined classifier. After pruning invalid rules and ranking all rules, we can obtain the final modified classifier as the result and apply the modified classifier on other data in the population.
Applying the algorithm we proposed in data mining under big data environment, we can generate the result that is comparable to the one using the whole dataset. Moreover, the processing time is significantly reduced and thus the analytic result can be obtained in time to make further applications.
Subjects
海量資料
增量式資料
資料探勘
資料分類
資料準備
資料抽樣
屬性選擇
Type
thesis
File(s)![Thumbnail Image]()
Loading...
Name
ntu-103-R01725017-1.pdf
Size
23.32 KB
Format
Adobe PDF
Checksum
(MD5):3bb0a95edb8d74974234c6d376b00090
