Online Learning in Markov Decision Processes with Changing Cost Sequences. Dick, T.; György, A.; and Szepesvári, C. In ICML, pages 512–520, January, 2014.
Online Learning in Markov Decision Processes with Changing Cost Sequences [pdf]Paper  abstract   bibtex   2 downloads  
In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and bandit-information. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD^2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks. We provide a rigorous complexity analysis of these techniques, while providing near-optimal regret-bounds (in particular, we take into account the computational costs of performing approximate projections in MD^2). In the case of full-information feedback, our results complement existing ones. In the case of bandit-information feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the state-visitation probabilities are uniformly bounded away from zero under all policies.
@inproceedings{DiGyoSze14,
	abstract = {In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and bandit-information.  We propose to view this problem as an instance of online linear optimization.  We propose two methods for this problem: MD^2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks.  We provide a rigorous complexity analysis of these techniques, while providing near-optimal regret-bounds (in particular, we take into account the computational costs of performing approximate projections in MD^2).  In the case of full-information feedback, our results complement existing ones. In the case of bandit-information feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the state-visitation probabilities are uniformly bounded away from zero under all policies.},
	author = {Dick, T. and Gy{\"o}rgy, A. and Szepesv{\'a}ri, Cs.},
	booktitle = {ICML},
	date = {2014-01},
	date-added = {2014-01-13 10:52:36 -0700},
	date-modified = {2014-07-19 17:47:40 -0600},
	keywords = {online learning, adversarial setting, finite MDPs, recurrent MDPs, shortest path problem, theory},
	month = {January},
	pages = {512--520},
	title = {Online Learning in Markov Decision Processes with Changing Cost Sequences},
	url_paper = {omdp_icml.pdf},
	year = {2014}}
Downloads: 2