Off-Policy Actor-Critic

Off-Policy Actor-Critic. Degris, T., White, M., & Sutton, R. S. arXiv:1205.4839 [cs], May, 2012. arXiv: 1205.4839

This paper presents the first actor-critic algorithm for off-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in off-policy gradient temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable a target policy to be learned while following and obtaining data from another (behavior) policy. For many problems, however, actor-critic methods are more practical than action value methods (like Greedy-GQ) because they explicitly represent the policy; consequently, the policy can be stochastic and utilize a large action space. In this paper, we illustrate how to practically combine the generality and learning potential of off-policy learning with the flexibility in action selection given by actor-critic methods. We derive an incremental, linear time and space complexity algorithm that includes eligibility traces, prove convergence under assumptions similar to previous off-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems.

@article{degris_off-policy_2012,
	title = {Off-{Policy} {Actor}-{Critic}},
	url = {http://arxiv.org/abs/1205.4839},
	abstract = {This paper presents the first actor-critic algorithm for off-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in off-policy gradient temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable a target policy to be learned while following and obtaining data from another (behavior) policy. For many problems, however, actor-critic methods are more practical than action value methods (like Greedy-GQ) because they explicitly represent the policy; consequently, the policy can be stochastic and utilize a large action space. In this paper, we illustrate how to practically combine the generality and learning potential of off-policy learning with the flexibility in action selection given by actor-critic methods. We derive an incremental, linear time and space complexity algorithm that includes eligibility traces, prove convergence under assumptions similar to previous off-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems.},
	urldate = {2019-05-09},
	journal = {arXiv:1205.4839 [cs]},
	author = {Degris, Thomas and White, Martha and Sutton, Richard S.},
	month = may,
	year = {2012},
	note = {arXiv: 1205.4839},
	keywords = {Computer Science - Machine Learning}
}

Downloads: 0

{"_id":"CLdAMZxkN3vyeeRTJ","bibbaseid":"degris-white-sutton-offpolicyactorcritic-2012","authorIDs":[],"author_short":["Degris, T.","White, M.","Sutton, R. S."],"bibdata":{"bibtype":"article","type":"article","title":"Off-Policy Actor-Critic","url":"http://arxiv.org/abs/1205.4839","abstract":"This paper presents the first actor-critic algorithm for off-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in off-policy gradient temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable a target policy to be learned while following and obtaining data from another (behavior) policy. For many problems, however, actor-critic methods are more practical than action value methods (like Greedy-GQ) because they explicitly represent the policy; consequently, the policy can be stochastic and utilize a large action space. In this paper, we illustrate how to practically combine the generality and learning potential of off-policy learning with the flexibility in action selection given by actor-critic methods. We derive an incremental, linear time and space complexity algorithm that includes eligibility traces, prove convergence under assumptions similar to previous off-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems.","urldate":"2019-05-09","journal":"arXiv:1205.4839 [cs]","author":[{"propositions":[],"lastnames":["Degris"],"firstnames":["Thomas"],"suffixes":[]},{"propositions":[],"lastnames":["White"],"firstnames":["Martha"],"suffixes":[]},{"propositions":[],"lastnames":["Sutton"],"firstnames":["Richard","S."],"suffixes":[]}],"month":"May","year":"2012","note":"arXiv: 1205.4839","keywords":"Computer Science - Machine Learning","bibtex":"@article{degris_off-policy_2012,\n\ttitle = {Off-{Policy} {Actor}-{Critic}},\n\turl = {http://arxiv.org/abs/1205.4839},\n\tabstract = {This paper presents the first actor-critic algorithm for off-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in off-policy gradient temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable a target policy to be learned while following and obtaining data from another (behavior) policy. For many problems, however, actor-critic methods are more practical than action value methods (like Greedy-GQ) because they explicitly represent the policy; consequently, the policy can be stochastic and utilize a large action space. In this paper, we illustrate how to practically combine the generality and learning potential of off-policy learning with the flexibility in action selection given by actor-critic methods. We derive an incremental, linear time and space complexity algorithm that includes eligibility traces, prove convergence under assumptions similar to previous off-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems.},\n\turldate = {2019-05-09},\n\tjournal = {arXiv:1205.4839 [cs]},\n\tauthor = {Degris, Thomas and White, Martha and Sutton, Richard S.},\n\tmonth = may,\n\tyear = {2012},\n\tnote = {arXiv: 1205.4839},\n\tkeywords = {Computer Science - Machine Learning}\n}\n\n","author_short":["Degris, T.","White, M.","Sutton, R. S."],"key":"degris_off-policy_2012","id":"degris_off-policy_2012","bibbaseid":"degris-white-sutton-offpolicyactorcritic-2012","role":"author","urls":{"Paper":"http://arxiv.org/abs/1205.4839"},"keyword":["Computer Science - Machine Learning"],"downloads":0,"html":""},"bibtype":"article","biburl":"https://bibbase.org/zotero/asneha213","creationDate":"2019-06-06T20:57:45.711Z","downloads":0,"keywords":["computer science - machine learning"],"search_terms":["policy","actor","critic","degris","white","sutton"],"title":"Off-Policy Actor-Critic","year":2012,"dataSources":["fjacg9txEnNSDwee6"]}