Online Learning in Markov Decision Processes with Changing Cost Sequences

Online Learning in Markov Decision Processes with Changing Cost Sequences. Dick, T., György, A., & Szepesvári, C. In ICML, pages 512–520, 01, 2014.

Paper abstract bibtex 3 downloads

In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and bandit-information. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD^2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks. We provide a rigorous complexity analysis of these techniques, while providing near-optimal regret-bounds (in particular, we take into account the computational costs of performing approximate projections in MD^2). In the case of full-information feedback, our results complement existing ones. In the case of bandit-information feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the state-visitation probabilities are uniformly bounded away from zero under all policies.

@inproceedings{DiGyoSze14,
	abstract = {In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and bandit-information.  We propose to view this problem as an instance of online linear optimization.  We propose two methods for this problem: MD^2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks.  We provide a rigorous complexity analysis of these techniques, while providing near-optimal regret-bounds (in particular, we take into account the computational costs of performing approximate projections in MD^2).  In the case of full-information feedback, our results complement existing ones. In the case of bandit-information feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the state-visitation probabilities are uniformly bounded away from zero under all policies.},
	acceptrate = {310 out of 1238=25\%},
	author = {Dick, T. and Gy{\"o}rgy, A. and Szepesv{\'a}ri, Cs.},
	booktitle = {ICML},
	keywords = {online learning, adversarial setting, finite MDPs, recurrent MDPs, shortest path problem, theory},
	month = {01},
	pages = {512--520},
	title = {Online Learning in Markov Decision Processes with Changing Cost Sequences},
	url_paper = {omdp_icml.pdf},
	year = {2014}}

Downloads: 3