On Feasibility-Oriented Mining of Frequent Patterns
Date Issued
2006
Date
2006
Author(s)
Chuang, Kun-Ta
DOI
en-US
Abstract
Since the early work in algorithm Apriori, a broad spectrum of topics in mining frequent patterns has been studied. While those proposed techniques are important results toward the integration of mining association mining and other real-life requirements, how to provide feasibility-oriented models for mining frequent patterns, to enable easy-use, low-cost, high-efficiency, and realistic mining applications, still remains as a challenging issue.
In view of this, we explore in this dissertation a novel algorithm of mining top-k (closed) itemsets in the presence of the memory constraint. As opposed to most previous works that concentrate on improving the mining efficiency or on reducing the memory size by best effort, we first attempt to specify the available upper memory size that can be utilized by mining frequent itemsets. While complying with the requirement of the memory constraint, two efficient algorithms, called MTK and MTK_Close, were thus devised for mining frequent itemsets and closed itemsets, respectively, without specifying the subtle minimum support. Instead, users only need to give a more human-understandable parameter, namely the desired number of frequent (closed) itemsets k.
Furthermore, a sampling model, called feature preserved sampling (FPS) that sequentially generates a high-quality sample over sliding windows, is developed. The sampling quality we consider refers to the degree of consistency between the sample proportion and the population proportion of each attribute value in a window. FPS has several advantages: (1) it sequentially generates a sample from a time-variant data source over sliding windows; (2) the execution time of FPS is linear with respect to the database size; (3) the relative proportional differences between the sample proportions and population proportions of most distinct attribute values are guaranteed to be below a specified error threshold, ε, while the relative proportion differences of the remaining attribute values are as close to ε as possible, which ensures that the generated sample is of high-quality; (4) the sample rate is close to the user specified rate so that a high-quality sampling result can be obtained without increasing the sample size; (5) FPS can excellently preserve the population proportion of multivariate statistics in the sample; and (6) FPS can be applied to infinite streams and finite datasets equally, and the generated samples can be
used for various applications.
We next investigate an important characteristic in real datasets, named the itemset support distribution, to provide better understanding on real datasets. The itemset support distribution refers to the distribution of the count of itemsets versus the itemset support. Importantly, from observations on various retail datasets and as validated by our empirical studies later, we find that the power-law relationship indeed appears in the itemset support distribution and we can characterize that as a Zipf distribution.
Since it is prohibitively expensive to retrieve lots of itemsets before we identify the characteristics of the itemset support distribution in targeted data, we also propose a valid and cost-effective algorithm, called algorithm PPL, to extract characteristics of the itemset support distribution. Furthermore, to fully explore the advantages of our discovery, we also propose novel mechanisms with the help of PPL to solve two important problems: (1) determining a subtle parameter for mining approximate frequent itemsets over data streams; and (2) determining the sufficient sample size for mining frequent patterns.
In this dissertation, we also attempt to answer an important question: "What patterns will be frequent in the future?" Such a kind of patterns, referred to as prospective frequent patterns, is very informative to end-users, because many cross-selling strategies in real cases rely on the precise prediction of frequent patterns that will appear. Since any naive extension of previous works cannot effectively obtain the desired result, we proposed the framework of PFP, to precisely predict prospective frequent patterns while also predicting their supports.
Subjects
資料探勘
頻繁集
資料取樣
記憶體
Data Mining
Frequent Patterns
Data Sampling
Memory
Type
thesis
File(s)![Thumbnail Image]()
Loading...
Name
ntu-95-F89921134-1.pdf
Size
23.31 KB
Format
Adobe PDF
Checksum
(MD5):1206b561cccc2c6c88d6b369d90795e0
