Online Learning in Markov Decision Processes with Changing Cost Sequences. Dick, T.; György, A.; and Szepesvári, C. In ICML, pages 512–520, January, 2014.
Paper abstract bibtex 2 downloads In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and bandit-information. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD^2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks. We provide a rigorous complexity analysis of these techniques, while providing near-optimal regret-bounds (in particular, we take into account the computational costs of performing approximate projections in MD^2). In the case of full-information feedback, our results complement existing ones. In the case of bandit-information feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the state-visitation probabilities are uniformly bounded away from zero under all policies.
@inproceedings{DiGyoSze14,
abstract = {In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and bandit-information. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD^2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks. We provide a rigorous complexity analysis of these techniques, while providing near-optimal regret-bounds (in particular, we take into account the computational costs of performing approximate projections in MD^2). In the case of full-information feedback, our results complement existing ones. In the case of bandit-information feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the state-visitation probabilities are uniformly bounded away from zero under all policies.},
author = {Dick, T. and Gy{\"o}rgy, A. and Szepesv{\'a}ri, Cs.},
booktitle = {ICML},
date = {2014-01},
date-added = {2014-01-13 10:52:36 -0700},
date-modified = {2014-07-19 17:47:40 -0600},
keywords = {online learning, adversarial setting, finite MDPs, recurrent MDPs, shortest path problem, theory},
month = {January},
pages = {512--520},
title = {Online Learning in Markov Decision Processes with Changing Cost Sequences},
url_paper = {omdp_icml.pdf},
year = {2014}}