Adaptive clustering for multiple evolving streams
Resource
IEEE Transactions on Knowledge and Data Engineering
Journal
IEEE Transactions on Knowledge and Data Engineering
Journal Volume
18
Journal Issue
9
Pages
1166-1180
Date Issued
2006-09
Date
2006-09
Author(s)
DOI
246246/200611150121715
Abstract
In the data stream environment, the patterns generated at different time instances are different due to data evolution. As
time progresses, the behavior and members of clusters usually change. Hence, clustering continuous data streams allows us to
observe the changes of group behavior. In order to support flexible clustering requirements, we devise in this paper a Clustering on
Demand framework, abbreviated as COD framework, to dynamically cluster multiple data streams. While providing a general
framework of clustering on multiple data streams, the COD framework has two advantageous features, namely, one data scan for
online statistics collection and compact multiresolution approximations, which are designed to address, respectively, the time and the
space constraints in a data stream environment. The COD framework consists of two phases, i.e., the online maintenance phase and
the offline clustering phase. The online maintenance phase provides an efficient mechanism to maintain summary hierarchies of data
streams with multiple resolutions in time linear in both the number of streams and the number of data points in each stream. On the
other hand, an adaptive clustering algorithm is devised for the offline phase to retrieve approximations of desired substreams from
summary hierarchies according to clustering queries. We propose two summarization techniques, based on wavelet and regression
analyses, to construct the summary hierarchies. The regression-based summary hierarchy approximates the data stream more
precisely and provides better clustering results, at the cost of slightly longer time than and twice the storage space as the waveletbased
one. An adaptive version of COD framework is designed to make a selection between a wavelet-based model and a regressionbased
model for building the summary hierarchy. By the adaptive COD, we can obtain clustering results with almost the same quality
as the regression-based COD while using much less storage space for the summary hierarchy. As shown in the complexity analyses
and also validated by our empirical studies, the COD framework performs very efficiently in the data stream environment while
producing clustering results of very high quality.
Subjects
Data mining
clustering of multiple data streams
time-series clustering
Publisher
Taipei:National Taiwan University Dept Elect Engn
Type
journal article
File(s)![Thumbnail Image]()
Loading...
Name
687.pdf
Size
2.31 MB
Format
Adobe PDF
Checksum
(MD5):17645df45749e0d22a7b75bf3512018f
