Reinforcement learning algorithms are commonly used to control dynamical systems with measurable state variables. If the dynamical system is partially observable, reinforcement learning algorithms are modified to compensate for the effect of partial observability. One common approach is to feed a finite history of input-output data instead of the state variable. In this article, we study and quantify the effect of this approach in linear Gaussian systems with quadratic costs. We coin the concept of L-Extra-Sampled-dynamics to formalize the idea of using a finite history of input-output data instead of state and show that this approach increases the average cost.
Funding Agencies|Wallenberg AI, Autonomous Systems and Software Program (WASP); Alice Wallenberg Foundation; ZENITH, Excellence Center at Linkoeping-Lund in Information Technology (EL-LIIT); Sensor informatics and Decision-making for the Digital Transformation (SEDDIT); Wallenberg AI, Autonomous Systems and Software Program (WASP); National Science Foundation [ECCS-2227311]; Vinnova Competence Center LINK-SIC; Scalable Kalman Filters project through the Swedish Research Council