Mining time-changing data streams

Mining time-changing data streams. Hulten, G., Spencer, L., & Domingos, P. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, of KDD '01, pages 97–106, New York, NY, USA, August, 2001. Association for Computing Machinery.

Paper doi abstract bibtex

Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a stationary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying processes generating them changed during this time, sometimes radically. Although a number of algorithms have been proposed for learning time-changing concepts, they generally do not scale well to very large databases. In this paper we propose an efficient algorithm for mining decision trees from continuously-changing data streams, based on the ultra-fast VFDT decision tree learner. This algorithm, called CVFDT, stays current while making the most of old data by growing an alternative subtree whenever an old one becomes questionable, and replacing the old with the new when the new becomes more accurate. CVFDT learns a model which is similar in accuracy to the one that would be learned by reapplying VFDT to a moving window of examples every time a new example arrives, but with O(1) complexity per example, as opposed to O(w), where w is the size of the window. Experiments on a set of large time-changing data streams demonstrate the utility of this approach.

@inproceedings{hulten_mining_2001,
	address = {New York, NY, USA},
	series = {{KDD} '01},
	title = {Mining time-changing data streams},
	isbn = {978-1-58113-391-2},
	url = {https://doi.org/10.1145/502512.502529},
	doi = {10.1145/502512.502529},
	abstract = {Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a stationary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying processes generating them changed during this time, sometimes radically. Although a number of algorithms have been proposed for learning time-changing concepts, they generally do not scale well to very large databases. In this paper we propose an efficient algorithm for mining decision trees from continuously-changing data streams, based on the ultra-fast VFDT decision tree learner. This algorithm, called CVFDT, stays current while making the most of old data by growing an alternative subtree whenever an old one becomes questionable, and replacing the old with the new when the new becomes more accurate. CVFDT learns a model which is similar in accuracy to the one that would be learned by reapplying VFDT to a moving window of examples every time a new example arrives, but with O(1) complexity per example, as opposed to O(w), where w is the size of the window. Experiments on a set of large time-changing data streams demonstrate the utility of this approach.},
	urldate = {2022-03-16},
	booktitle = {Proceedings of the seventh {ACM} {SIGKDD} international conference on {Knowledge} discovery and data mining},
	publisher = {Association for Computing Machinery},
	author = {Hulten, Geoff and Spencer, Laurie and Domingos, Pedro},
	month = aug,
	year = {2001},
	keywords = {Decision trees, Hoeffding bounds, concept drift, data streams, incremental learning, subsampling},
	pages = {97--106},
}

Downloads: 0

{"_id":"BLQxKycZMLnXreE2k","bibbaseid":"hulten-spencer-domingos-miningtimechangingdatastreams-2001","author_short":["Hulten, G.","Spencer, L.","Domingos, P."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","address":"New York, NY, USA","series":"KDD '01","title":"Mining time-changing data streams","isbn":"978-1-58113-391-2","url":"https://doi.org/10.1145/502512.502529","doi":"10.1145/502512.502529","abstract":"Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a stationary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying processes generating them changed during this time, sometimes radically. Although a number of algorithms have been proposed for learning time-changing concepts, they generally do not scale well to very large databases. In this paper we propose an efficient algorithm for mining decision trees from continuously-changing data streams, based on the ultra-fast VFDT decision tree learner. This algorithm, called CVFDT, stays current while making the most of old data by growing an alternative subtree whenever an old one becomes questionable, and replacing the old with the new when the new becomes more accurate. CVFDT learns a model which is similar in accuracy to the one that would be learned by reapplying VFDT to a moving window of examples every time a new example arrives, but with O(1) complexity per example, as opposed to O(w), where w is the size of the window. Experiments on a set of large time-changing data streams demonstrate the utility of this approach.","urldate":"2022-03-16","booktitle":"Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining","publisher":"Association for Computing Machinery","author":[{"propositions":[],"lastnames":["Hulten"],"firstnames":["Geoff"],"suffixes":[]},{"propositions":[],"lastnames":["Spencer"],"firstnames":["Laurie"],"suffixes":[]},{"propositions":[],"lastnames":["Domingos"],"firstnames":["Pedro"],"suffixes":[]}],"month":"August","year":"2001","keywords":"Decision trees, Hoeffding bounds, concept drift, data streams, incremental learning, subsampling","pages":"97–106","bibtex":"@inproceedings{hulten_mining_2001,\n\taddress = {New York, NY, USA},\n\tseries = {{KDD} '01},\n\ttitle = {Mining time-changing data streams},\n\tisbn = {978-1-58113-391-2},\n\turl = {https://doi.org/10.1145/502512.502529},\n\tdoi = {10.1145/502512.502529},\n\tabstract = {Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a stationary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying processes generating them changed during this time, sometimes radically. Although a number of algorithms have been proposed for learning time-changing concepts, they generally do not scale well to very large databases. In this paper we propose an efficient algorithm for mining decision trees from continuously-changing data streams, based on the ultra-fast VFDT decision tree learner. This algorithm, called CVFDT, stays current while making the most of old data by growing an alternative subtree whenever an old one becomes questionable, and replacing the old with the new when the new becomes more accurate. CVFDT learns a model which is similar in accuracy to the one that would be learned by reapplying VFDT to a moving window of examples every time a new example arrives, but with O(1) complexity per example, as opposed to O(w), where w is the size of the window. Experiments on a set of large time-changing data streams demonstrate the utility of this approach.},\n\turldate = {2022-03-16},\n\tbooktitle = {Proceedings of the seventh {ACM} {SIGKDD} international conference on {Knowledge} discovery and data mining},\n\tpublisher = {Association for Computing Machinery},\n\tauthor = {Hulten, Geoff and Spencer, Laurie and Domingos, Pedro},\n\tmonth = aug,\n\tyear = {2001},\n\tkeywords = {Decision trees, Hoeffding bounds, concept drift, data streams, incremental learning, subsampling},\n\tpages = {97--106},\n}\n\n\n\n","author_short":["Hulten, G.","Spencer, L.","Domingos, P."],"key":"hulten_mining_2001","id":"hulten_mining_2001","bibbaseid":"hulten-spencer-domingos-miningtimechangingdatastreams-2001","role":"author","urls":{"Paper":"https://doi.org/10.1145/502512.502529"},"keyword":["Decision trees","Hoeffding bounds","concept drift","data streams","incremental learning","subsampling"],"metadata":{"authorlinks":{}},"html":""},"bibtype":"inproceedings","biburl":"https://bibbase.org/zotero/mh_lenguyen","dataSources":["iwKepCrWBps7ojhDx"],"keywords":["decision trees","hoeffding bounds","concept drift","data streams","incremental learning","subsampling"],"search_terms":["mining","time","changing","data","streams","hulten","spencer","domingos"],"title":"Mining time-changing data streams","year":2001}