In this paper, we analyze a Linear Quadratic (LQ) control problem in terms of the average cost and the structure of the value function. We develop a completely model-free reinforcement learning algorithm to solve the LQ problem. Our algorithm is an off-policy routine where each policy is greedy with respect to all previous value functions. We prove that the algorithm produces stable policies given that the estimation errors remain small. Empirically, our algorithm outperforms the classical Q and off-policy learning routines.
Funding Agencies|Vinnova Competence Center LINK-SIC; Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP)