Mining of Frequent Temporal Patterns on Data Streams
Date Issued
2004
Date
2004
Author(s)
Teng, Wei-Guang
DOI
en-US
Abstract
In recent years, several query problems and mining capabilities have been explored for a data stream environment. Among various data mining capabilities, the one receiving a significant amount of research attention is on mining frequent patterns over market basket data. In this
dissertation, we first explore the model of frequent itemsets from static transaction databases and generalize relevant concepts to discovering of temporal relationship from online transaction flows. Then, we investigate the resource utilization issues in a data stream environment. Finally, we study the problem of quality guarantees when tracking online data streams.
For the problem of mining frequent itemsets to derive association rules, a new mining capability, called mining of substitution rules, is first developed by extending the concepts of mining of association rules. Substitution refers to the choice made by a customer to replace the purchase of some items with that of others. The discovery of substitution rules, same as that of association rules, will lead to very valuable knowledge in various aspects, including market prediction, user behavior analysis and decision support. Specifically, we first derive theoretical properties for the model of substitution rule mining and devise a technique on the induction of positive itemset supports to improve the efficiency of support counting for negative itemsets. Then, in light of these properties, algorithm SRM (standing for substitution rule mining) is designed and implemented to discover the substitution rules efficiently while attaining good statistical significance.
To mine frequent temporal patterns on data streams, a regression-based algorithm, called algorithm FTP-DS (Frequent Temporal Patterns of Data Streams) is devised. While providing a general framework of pattern frequency counting, algorithm FTP-DS has two major features, namely one data scan for online statistics collection and regression-based compact pattern representation. To attain the feature of one data scan, the data segmentation and the pattern growth scenarios are explored for the frequency counting purpose. Algorithm FTP-DS scans online transaction flows and generates candidate frequent patterns in real time. The second important feature of algorithm FTP-DS is on the regression-based compact pattern representation.
In addition, we develop the techniques of the segmentation tuning and segment relaxation to enhance the functions of FTP-DS. With these features, algorithm FTP-DS is able to not only conduct mining with variable time intervals but also perform trend detection effectively.
The fundamental problem that how the limited resources, e.g., memory space and computation power, can be well utilized to produce accurate estimates in a data stream environment is also addressed. Two important features for tracking mined patterns with properly utilized resources are examined. The first issue is temporal granularity which refers to the phenomenon that as time advances, people are more interested in recent events, meaning that more resources can be utilized to explore more recent data with finer granularities. Second, with the mining task of discovering frequent temporal patterns, more resources are expected to be allocated to the processing of those borderline patterns whose statistics, e.g., occurrence frequencies, are close to the specified threshold so as to have proper frequent itemset identification. This feature is
called mining with support count granularity. Consequently, a wavelet-based algorithm, called algorithm RAM-DS (Resource-Aware Mining for Data Streams) is devised to perform general pattern mining tasks for data streams by exploring both temporal and support count granularities.
Algorithm RAM-DS is designed to not only reduce the memory required for data storage but also retain good approximation of target time series. In addition, algorithm RAM-DS can support a varying number of data streams by allocating memory space adaptively when tracking
patterns generated from online transactions.
For tracking online time series data which is directly collected from sensors or is generated by stream mining algorithms, we explore the energy preservation property of wavelet-based transform. The commonly used L1- and L2-error metrics are theoretically guaranteed when insignificant coefficients are discarded for saving precious resources in our framework. In addition, to handle infinite online data flows, an enhanced data structure RAID-tree which is based on the error tree is proposed for dynamic synopses maintenance over data streams. Specifically, an algorithm RAID with the resolution adaptability for incremental decomposition is developed. Experimental results have shown that the memory required for storing significant features of time series data is very small and the quality of approximation is stable when performing incremental
data updates.
dissertation, we first explore the model of frequent itemsets from static transaction databases and generalize relevant concepts to discovering of temporal relationship from online transaction flows. Then, we investigate the resource utilization issues in a data stream environment. Finally, we study the problem of quality guarantees when tracking online data streams.
For the problem of mining frequent itemsets to derive association rules, a new mining capability, called mining of substitution rules, is first developed by extending the concepts of mining of association rules. Substitution refers to the choice made by a customer to replace the purchase of some items with that of others. The discovery of substitution rules, same as that of association rules, will lead to very valuable knowledge in various aspects, including market prediction, user behavior analysis and decision support. Specifically, we first derive theoretical properties for the model of substitution rule mining and devise a technique on the induction of positive itemset supports to improve the efficiency of support counting for negative itemsets. Then, in light of these properties, algorithm SRM (standing for substitution rule mining) is designed and implemented to discover the substitution rules efficiently while attaining good statistical significance.
To mine frequent temporal patterns on data streams, a regression-based algorithm, called algorithm FTP-DS (Frequent Temporal Patterns of Data Streams) is devised. While providing a general framework of pattern frequency counting, algorithm FTP-DS has two major features, namely one data scan for online statistics collection and regression-based compact pattern representation. To attain the feature of one data scan, the data segmentation and the pattern growth scenarios are explored for the frequency counting purpose. Algorithm FTP-DS scans online transaction flows and generates candidate frequent patterns in real time. The second important feature of algorithm FTP-DS is on the regression-based compact pattern representation.
In addition, we develop the techniques of the segmentation tuning and segment relaxation to enhance the functions of FTP-DS. With these features, algorithm FTP-DS is able to not only conduct mining with variable time intervals but also perform trend detection effectively.
The fundamental problem that how the limited resources, e.g., memory space and computation power, can be well utilized to produce accurate estimates in a data stream environment is also addressed. Two important features for tracking mined patterns with properly utilized resources are examined. The first issue is temporal granularity which refers to the phenomenon that as time advances, people are more interested in recent events, meaning that more resources can be utilized to explore more recent data with finer granularities. Second, with the mining task of discovering frequent temporal patterns, more resources are expected to be allocated to the processing of those borderline patterns whose statistics, e.g., occurrence frequencies, are close to the specified threshold so as to have proper frequent itemset identification. This feature is
called mining with support count granularity. Consequently, a wavelet-based algorithm, called algorithm RAM-DS (Resource-Aware Mining for Data Streams) is devised to perform general pattern mining tasks for data streams by exploring both temporal and support count granularities.
Algorithm RAM-DS is designed to not only reduce the memory required for data storage but also retain good approximation of target time series. In addition, algorithm RAM-DS can support a varying number of data streams by allocating memory space adaptively when tracking
patterns generated from online transactions.
For tracking online time series data which is directly collected from sensors or is generated by stream mining algorithms, we explore the energy preservation property of wavelet-based transform. The commonly used L1- and L2-error metrics are theoretically guaranteed when insignificant coefficients are discarded for saving precious resources in our framework. In addition, to handle infinite online data flows, an enhanced data structure RAID-tree which is based on the error tree is proposed for dynamic synopses maintenance over data streams. Specifically, an algorithm RAID with the resolution adaptability for incremental decomposition is developed. Experimental results have shown that the memory required for storing significant features of time series data is very small and the quality of approximation is stable when performing incremental
data updates.
Subjects
資料串流
小波轉換
頻繁時間樣式
資料探勘
Frequent Temporal Pattern
Data Stream
Data Mining
Wavelet Transform
Type
thesis