A Heuristic Data Sampling Approach for Data Mining Preparation under Big Data Environment

Peng, Huai-De

A Heuristic Data Sampling Approach for Data Mining Preparation under Big Data Environment

Date Issued

2014

Date

2014

Author(s)

Peng, Huai-De

URI

http://ntur.lib.ntu.edu.tw//handle/246246/263492

Abstract

Big data has been a greatly popular topic among industries and academics. It contains several characteristics which are extreme scale, various data sources and incremental. These characteristics make big data harder to be analyzed while classic data mining techniques are highly possible to be infeasible. Especially, the processing time may not be efficient enough to generate analytic results in time due to the characteristics of big data. Furthermore, it could fail to generate any result since the number of objects and attributes are too large. In this study, we use classification based on association rules as our data mining technique. Under the premise of not changing existing data mining method, we try to solve the problem of big data by data preparation, integration and evaluation. The algorithm we proposed separates to two parts. The first part is a heuristic sampling method at the initial phase. Samples the data that is representative to the population of big data and then selects attributes which are important and discriminative. The sampling result can be further applied to following data mining techniques. For the purpose of handling different class distributions, we can apply undersampling method for some specific rare class to generate corresponding rules. The second part is dealing with incremental problem. Using the sampled data of initial phase from both the preliminary data and the incremental data and their classifiers, we merge the data and apply these data to verify the combined classifier. After pruning invalid rules and ranking all rules, we can obtain the final modified classifier as the result and apply the modified classifier on other data in the population. Applying the algorithm we proposed in data mining under big data environment, we can generate the result that is comparable to the one using the whole dataset. Moreover, the processing time is significantly reduced and thus the analytic result can be obtained in time to make further applications.

Subjects

海量資料

增量式資料

資料探勘

資料分類

資料準備

資料抽樣

屬性選擇

Type

thesis

File(s)

Name

ntu-103-R01725017-1.pdf

Size

23.32 KB

Format

Adobe PDF

Checksum

(MD5):3bb0a95edb8d74974234c6d376b00090

A Heuristic Data Sampling Approach for Data Mining Preparation under Big Data Environment

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)