On Feasibility-Oriented Mining of Frequent Patterns

Chuang, Kun-Ta

On Feasibility-Oriented Mining of Frequent Patterns

Date Issued

2006

Date

2006

Author(s)

Chuang, Kun-Ta

DOI

en-US

URI

http://ntur.lib.ntu.edu.tw//handle/246246/58587

Abstract

Since the early work in algorithm Apriori, a broad spectrum of topics in mining frequent patterns has been studied. While those proposed techniques are important results toward the integration of mining association mining and other real-life requirements, how to provide feasibility-oriented models for mining frequent patterns, to enable easy-use, low-cost, high-efficiency, and realistic mining applications, still remains as a challenging issue. In view of this, we explore in this dissertation a novel algorithm of mining top-k (closed) itemsets in the presence of the memory constraint. As opposed to most previous works that concentrate on improving the mining efficiency or on reducing the memory size by best effort, we first attempt to specify the available upper memory size that can be utilized by mining frequent itemsets. While complying with the requirement of the memory constraint, two efficient algorithms, called MTK and MTK_Close, were thus devised for mining frequent itemsets and closed itemsets, respectively, without specifying the subtle minimum support. Instead, users only need to give a more human-understandable parameter, namely the desired number of frequent (closed) itemsets k. Furthermore, a sampling model, called feature preserved sampling (FPS) that sequentially generates a high-quality sample over sliding windows, is developed. The sampling quality we consider refers to the degree of consistency between the sample proportion and the population proportion of each attribute value in a window. FPS has several advantages: (1) it sequentially generates a sample from a time-variant data source over sliding windows; (2) the execution time of FPS is linear with respect to the database size; (3) the relative proportional differences between the sample proportions and population proportions of most distinct attribute values are guaranteed to be below a specified error threshold, ε, while the relative proportion differences of the remaining attribute values are as close to ε as possible, which ensures that the generated sample is of high-quality; (4) the sample rate is close to the user specified rate so that a high-quality sampling result can be obtained without increasing the sample size; (5) FPS can excellently preserve the population proportion of multivariate statistics in the sample; and (6) FPS can be applied to infinite streams and finite datasets equally, and the generated samples can be used for various applications. We next investigate an important characteristic in real datasets, named the itemset support distribution, to provide better understanding on real datasets. The itemset support distribution refers to the distribution of the count of itemsets versus the itemset support. Importantly, from observations on various retail datasets and as validated by our empirical studies later, we find that the power-law relationship indeed appears in the itemset support distribution and we can characterize that as a Zipf distribution. Since it is prohibitively expensive to retrieve lots of itemsets before we identify the characteristics of the itemset support distribution in targeted data, we also propose a valid and cost-effective algorithm, called algorithm PPL, to extract characteristics of the itemset support distribution. Furthermore, to fully explore the advantages of our discovery, we also propose novel mechanisms with the help of PPL to solve two important problems: (1) determining a subtle parameter for mining approximate frequent itemsets over data streams; and (2) determining the sufficient sample size for mining frequent patterns. In this dissertation, we also attempt to answer an important question: "What patterns will be frequent in the future?" Such a kind of patterns, referred to as prospective frequent patterns, is very informative to end-users, because many cross-selling strategies in real cases rely on the precise prediction of frequent patterns that will appear. Since any naive extension of previous works cannot effectively obtain the desired result, we proposed the framework of PFP, to precisely predict prospective frequent patterns while also predicting their supports.

Subjects

資料探勘

頻繁集

資料取樣

記憶體

Data Mining

Frequent Patterns

Data Sampling

Memory

Type

thesis

File(s)

Name

ntu-95-F89921134-1.pdf

Size

23.31 KB

Format

Adobe PDF

Checksum

(MD5):1206b561cccc2c6c88d6b369d90795e0

On Feasibility-Oriented Mining of Frequent Patterns

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)