A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP

Created by MG96

External Public cs.LG cs.AI

Statistics

Citations
0
References
33
Last updated
Loading...
Authors

Tejaram Sangadi L. A. Prashanth Krishna Jagannathan
Project Resources

Name Type Source Actions
ArXiv Paper Paper arXiv
Semantic Scholar Paper Semantic Scholar
Abstract

Motivated by applications in risk-sensitive reinforcement learning, we study mean-variance optimization in a discounted reward Markov Decision Process (MDP). Specifically, we analyze a Temporal Difference (TD) learning algorithm with linear function approximation (LFA) for policy evaluation. We derive finite-sample bounds that hold (i) in the mean-squared sense and (ii) with high probability under tail iterate averaging, both with and without regularization. Our bounds exhibit an exponentially decaying dependence on the initial error and a convergence rate of $O(1/t)$ after $t$ iterations. Moreover, for the regularized TD variant, our bound holds for a universal step size. Next, we integrate a Simultaneous Perturbation Stochastic Approximation (SPSA)-based actor update with an LFA critic and establish an $O(n^{-1/4})$ convergence guarantee, where $n$ denotes the iterations of the SPSA-based actor-critic algorithm. These results establish finite-sample theoretical guarantees for risk-sensitive actor-critic methods in reinforcement learning, with a focus on variance as a risk measure.

Note:

No note available for this project.

No note available for this project.
Contact:

No contact available for this project.

No contact available for this project.