Decision Trees for Mining Data Streams Based on the McDiarmid's Bound. Rutkowski, L., Pietruczuk, L., Duda, P., & Jaworski, M. IEEE Transactions on Knowledge and Data Engineering, 25(6):1272–1279, June, 2013. Conference Name: IEEE Transactions on Knowledge and Data Engineering
doi  abstract   bibtex   
In mining data streams the most popular tool is the Hoeffding tree algorithm. It uses the Hoeffding's bound to determine the smallest number of examples needed at a node to select a splitting attribute. In the literature the same Hoeffding's bound was used for any evaluation function (heuristic measure), e.g., information gain or Gini index. In this paper, it is shown that the Hoeffding's inequality is not appropriate to solve the underlying problem. We prove two theorems presenting the McDiarmid's bound for both the information gain, used in ID3 algorithm, and for Gini index, used in Classification and Regression Trees (CART) algorithm. The results of the paper guarantee that a decision tree learning system, applied to data streams and based on the McDiarmid's bound, has the property that its output is nearly identical to that of a conventional learner. The results of the paper have a great impact on the state of the art of mining data streams and various developed so far methods and algorithms should be reconsidered.
@article{rutkowski_decision_2013,
	title = {Decision {Trees} for {Mining} {Data} {Streams} {Based} on the {McDiarmid}'s {Bound}},
	volume = {25},
	issn = {1558-2191},
	doi = {10.1109/TKDE.2012.66},
	abstract = {In mining data streams the most popular tool is the Hoeffding tree algorithm. It uses the Hoeffding's bound to determine the smallest number of examples needed at a node to select a splitting attribute. In the literature the same Hoeffding's bound was used for any evaluation function (heuristic measure), e.g., information gain or Gini index. In this paper, it is shown that the Hoeffding's inequality is not appropriate to solve the underlying problem. We prove two theorems presenting the McDiarmid's bound for both the information gain, used in ID3 algorithm, and for Gini index, used in Classification and Regression Trees (CART) algorithm. The results of the paper guarantee that a decision tree learning system, applied to data streams and based on the McDiarmid's bound, has the property that its output is nearly identical to that of a conventional learner. The results of the paper have a great impact on the state of the art of mining data streams and various developed so far methods and algorithms should be reconsidered.},
	number = {6},
	journal = {IEEE Transactions on Knowledge and Data Engineering},
	author = {Rutkowski, Leszek and Pietruczuk, Lena and Duda, Piotr and Jaworski, Maciej},
	month = jun,
	year = {2013},
	note = {Conference Name: IEEE Transactions on Knowledge and Data Engineering},
	keywords = {Data mining, Data streams, Decision trees, Entropy, Gain measurement, Gini index, Hoeffding's bound, Indexes, Learning systems, McDiarmid's bound, Random variables, decision trees, information gain},
	pages = {1272--1279},
}

Downloads: 0