On the Optimality of Batch Policy Optimization Algorithms

On the Optimality of Batch Policy Optimization Algorithms. Wu, Y., Mei, J., Dai, B., Lattimore, T., Li, L., Szepesvári, C., & Schuurmans, D. In pages 11362–11371.

Paper

On the Optimality of Batch Policy Optimization Algorithms [link]

Link abstract bibtex

Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment. Although interest in this problem has grown significantly in recent years, its theoretical foundations remain under-developed. To advance the understanding of this problem, we provide three results that characterize the limits and possibilities of batch policy optimization in the finite-armed stochastic bandit setting. First, we introduce a class of confidence-adjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis. For this family, we show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral. Our analysis reveals that instance-dependent optimality, commonly used to establish optimality of on-line stochastic bandit algorithms, cannot be achieved by any algorithm in the batch setting. In particular, for any algorithm that performs optimally in some environment, there exists another environment where the same algorithm suffers arbitrarily larger regret. Therefore, to establish a framework for distinguishing algorithms, we introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction. We demonstrate how this criterion can be used to justify commonly used pessimistic principles for batch policy optimization.

@inproceedings{XiaoWMDL0SS21,
  author    = {
               Yifan Wu and
               Jincheng Mei and
               Bo Dai and
               Tor Lattimore and
               Lihong Li and
               Csaba Szepesv{\'{a}}ri and
               Dale Schuurmans},
  title     = {On the Optimality of Batch Policy Optimization Algorithms},
  pages     = {11362--11371},
  crossref  = {ICML2021},
  url_paper = {ICML2021-BatchPO.pdf},
  url_link  = {http://proceedings.mlr.press/v139/xiao21b.html},
  abstract  = {Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment. Although interest in this problem has grown significantly in recent years, its theoretical foundations remain under-developed. To advance the understanding of this problem, we provide three results that characterize the limits and possibilities of batch policy optimization in the finite-armed stochastic bandit setting. First, we introduce a class of confidence-adjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis. For this family, we show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral. Our analysis reveals that instance-dependent optimality, commonly used to establish optimality of on-line stochastic bandit algorithms, cannot be achieved by any algorithm in the batch setting. In particular, for any algorithm that performs optimally in some environment, there exists another environment where the same algorithm suffers arbitrarily larger regret. Therefore, to establish a framework for distinguishing algorithms, we introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction. We demonstrate how this criterion can be used to justify commonly used pessimistic principles for batch policy optimization.},
}

Downloads: 0

{"_id":"HFkxvZ5HXEmTgFh9a","bibbaseid":"wu-mei-dai-lattimore-li-szepesvri-schuurmans-ontheoptimalityofbatchpolicyoptimizationalgorithms","author_short":["Wu, Y.","Mei, J.","Dai, B.","Lattimore, T.","Li, L.","Szepesvári, C.","Schuurmans, D."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","author":[{"firstnames":["Yifan"],"propositions":[],"lastnames":["Wu"],"suffixes":[]},{"firstnames":["Jincheng"],"propositions":[],"lastnames":["Mei"],"suffixes":[]},{"firstnames":["Bo"],"propositions":[],"lastnames":["Dai"],"suffixes":[]},{"firstnames":["Tor"],"propositions":[],"lastnames":["Lattimore"],"suffixes":[]},{"firstnames":["Lihong"],"propositions":[],"lastnames":["Li"],"suffixes":[]},{"firstnames":["Csaba"],"propositions":[],"lastnames":["Szepesvári"],"suffixes":[]},{"firstnames":["Dale"],"propositions":[],"lastnames":["Schuurmans"],"suffixes":[]}],"title":"On the Optimality of Batch Policy Optimization Algorithms","pages":"11362–11371","crossref":"ICML2021","url_paper":"ICML2021-BatchPO.pdf","url_link":"http://proceedings.mlr.press/v139/xiao21b.html","abstract":"Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment. Although interest in this problem has grown significantly in recent years, its theoretical foundations remain under-developed. To advance the understanding of this problem, we provide three results that characterize the limits and possibilities of batch policy optimization in the finite-armed stochastic bandit setting. First, we introduce a class of confidence-adjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis. For this family, we show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral. Our analysis reveals that instance-dependent optimality, commonly used to establish optimality of on-line stochastic bandit algorithms, cannot be achieved by any algorithm in the batch setting. In particular, for any algorithm that performs optimally in some environment, there exists another environment where the same algorithm suffers arbitrarily larger regret. Therefore, to establish a framework for distinguishing algorithms, we introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction. We demonstrate how this criterion can be used to justify commonly used pessimistic principles for batch policy optimization.","bibtex":"@inproceedings{XiaoWMDL0SS21,\n author = {\n Yifan Wu and\n Jincheng Mei and\n Bo Dai and\n Tor Lattimore and\n Lihong Li and\n Csaba Szepesv{\\'{a}}ri and\n Dale Schuurmans},\n title = {On the Optimality of Batch Policy Optimization Algorithms},\n pages = {11362--11371},\n crossref = {ICML2021},\n url_paper = {ICML2021-BatchPO.pdf},\n url_link = {http://proceedings.mlr.press/v139/xiao21b.html},\n abstract = {Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment. Although interest in this problem has grown significantly in recent years, its theoretical foundations remain under-developed. To advance the understanding of this problem, we provide three results that characterize the limits and possibilities of batch policy optimization in the finite-armed stochastic bandit setting. First, we introduce a class of confidence-adjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis. For this family, we show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral. Our analysis reveals that instance-dependent optimality, commonly used to establish optimality of on-line stochastic bandit algorithms, cannot be achieved by any algorithm in the batch setting. In particular, for any algorithm that performs optimally in some environment, there exists another environment where the same algorithm suffers arbitrarily larger regret. Therefore, to establish a framework for distinguishing algorithms, we introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction. We demonstrate how this criterion can be used to justify commonly used pessimistic principles for batch policy optimization.},\n}\n\n","author_short":["Wu, Y.","Mei, J.","Dai, B.","Lattimore, T.","Li, L.","Szepesvári, C.","Schuurmans, D."],"key":"XiaoWMDL0SS21","id":"XiaoWMDL0SS21","bibbaseid":"wu-mei-dai-lattimore-li-szepesvri-schuurmans-ontheoptimalityofbatchpolicyoptimizationalgorithms","role":"author","urls":{" paper":"https://www.ualberta.ca/~szepesva/papers/ICML2021-BatchPO.pdf"," link":"http://proceedings.mlr.press/v139/xiao21b.html"},"metadata":{"authorlinks":{}},"html":""},"bibtype":"inproceedings","biburl":"https://www.ualberta.ca/~szepesva/papers/p2.bib","dataSources":["Ciq2jeFvPFYBCoxwJ","v2PxY4iCzrNyY9fhF"],"keywords":[],"search_terms":["optimality","batch","policy","optimization","algorithms","wu","mei","dai","lattimore","li","szepesvári","schuurmans"],"title":"On the Optimality of Batch Policy Optimization Algorithms","year":null}