This post will study the discrete-time, infinite-horizon, discounted LQR problem. The reference for this is Volume 2 of Bertsekas's Dynamic Programming and Optimal Control. I am writing these notes because I had a harder time finding derivations online for the discounted-cost version of LQR. \newcommand{\abs}[1]{| #1 |} \newcommand{\bigabs}[1]{\left| #1 \right|} \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\B}{\mathcal{B}} \newcommand{\ip}[2]{\langle #1, #2 \rangle} \newcommand{\T}{\mathsf{T}} \newcommand{\Tr}{\mathrm{Tr}} \newcommand{\ind}{\mathbf{1}} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\bmattwo}[4]{\begin{bmatrix} #1 & #2 \\ #3 & #4 \end{bmatrix}} \newcommand{\bmatcol}[2]{\begin{bmatrix} #1 \\ #2 \end{bmatrix}}

Setup

We have a discrete-time LQR system evolving as x_{k+1} = A x_k + B u_k + w_k \:, \:\: x_0 = 0 \:. where \E[w_k] = 0, \E[w_kw_k^\T] = \sigma^2 I, and w_k is independent from w_{k'} for all k \neq k'. Our goal is to find a control law u that minimizes \begin{align} J(u) = \E\left[ \sum_{k=0}^{\infty} \gamma^k (x_k^\T Q x_k + u_k^\T R u_k) \right] \:, \label{eq:discounted_problem} \end{align} with Q positive-semidefinite, R positive-definite, and \gamma \in (0, 1). This is to be contrasted with the infinite-horizon, average-cost LQR problem which seeks to minimize J(u) = \limsup_{T \to \infty} \E\left[ \frac{1}{T}\sum_{k=0}^{T-1} (x_k^\T Q x_k + u_k^\T R u_k) \right] \:.

Derivation

Given a control policy \pi, let V^\pi(x) denote the value-function V^\pi(x) = \E\left[ \sum_{k=0}^{\infty} \gamma^k (x_k^\T Q x_k + u_k^\T R u_k) \right] \:\: \text{s.t.} \:\: u_k = \pi(x_k) \:, and let V^\ast(x) denote the optimal value-function V^\ast(x) = \min_{\pi} V^\pi(x) \:. Bellman's optimality principle states that for all x, \begin{align} V^\ast(x) = \min_{u} \left\{ x^\T Q x + u^\T R u + \gamma \E_{x' \sim p(\cdot| x, u)}[ V^\ast(x')] \right\} \:. \label{eq:bellman} \end{align} Stipulating that V^\ast(x) = x^\T P x + q for a positive-semidefinite P and plugging into \eqref{eq:bellman}, \begin{align*} &x^\T P x + (1-\gamma)q \\ &\qquad= \min_u \left\{ x^\T Q x + u^\T R u + \gamma \E_{w}[ (Ax + Bu + w)^\T P (Ax + Bu + w) ] \right\} \\ &\qquad= \min_u \left\{ x^\T Q x + u^\T R u + \gamma (Ax + Bu)^\T P (Ax + Bu) + \gamma\sigma^2\Tr(P) \right\} \\ &\qquad= \min_u \left\{ \bmatcol{x}{u}^\T \left( \bmattwo{Q}{0}{0}{R} + \gamma \bmattwo{A^\T P A}{A^\T P B}{B^\T P A}{B^\T P B} \right) \bmatcol{x}{u} \right\} + \gamma \sigma^2 \Tr(P) \\ &\qquad= x^\T ( Q + \gamma A^\T P A - \gamma^2 A^\T P B (R + \gamma B^\T P B)^{-1} B^\T P A ) x + \gamma \sigma^2 \Tr(P) \:. \end{align*} The u which achieves the minimum is u^\ast = -\gamma(R + \gamma B^\T P B)^{-1} B^\T P A x. This fact and the last equality hold from partial minimization of a quadratic. Specifically, if P_{22} is positive-definite, it is straightforward to show that \min_{x_2} \bmatcol{x_1}{x_2}^\T \bmattwo{P_{11}}{P_{12}}{P_{12}^\T}{P_{22}} \bmatcol{x_1}{x_2} = x_1^\T (P_{11} - P_{12} P_{22}^{-1} P_{12}^\T ) x_1 = x_1^\T (P/P_{22}) x_1 \:, where x_2 = -P_{22}^{-1} P_{12}^\T x_1 achieves the minimum and P/P_{22} denotes the Schur complement. From this, we see that we can solve for P and q as \begin{align} P &= \gamma A^\T P A - \gamma^2 A^\T P B(R + \gamma B^\T P B)^{-1} B^\T P A + Q \:, \label{eq:soln} \\ q &= \sigma^2 \frac{\gamma}{1-\gamma} \Tr(P) \nonumber \:. \end{align}

Consequences

The canonical form for a discrete algebraic Ricatti equation is \mathrm{DARE}(A,B,Q,R), which is the solution to P = A^\T P A - A^\T P B(R + B^\T P B)^{-1} B^\T P A + Q \:. It is known that if (A,B) is controllable and (A,C) is observable (with Q = C^\T C), then the solution to \mathrm{DARE}(A,B,Q,R) is unique and positive-definite. Furthermore, defining K = -(R + B^\T P B)^{-1} B^\T P A, the resulting closed-loop matrix A+BK is stable.

By a simple change of variables, we see that \eqref{eq:soln} is equivalent to \mathrm{DARE}(\sqrt{\gamma} A, B, Q, R/\gamma). This means that as long as (\sqrt{\gamma}A,B) is controllable and (\sqrt{\gamma}A,C) is observable, then the closed-loop matrix for the discounted problem satisfies \sqrt{\gamma}(A+BK) is stable. First, we observe that the controllability and observability requirements are equivalent to (A,B) controllable and (A,C) observable:

Proposition: Let \alpha \neq 0. Then (A,B) is controllable iff (\alpha A, B) is controllable. Similarly, (A,C) is observable iff (\alpha A, C) is observable.

Proof: This follows immediately since \mathcal{R}(\begin{bmatrix} B & A B & ... & A^{n-1} B \end{bmatrix}) = \mathcal{R}(\begin{bmatrix} B & (\alpha A) B & ... & (\alpha A)^{n-1} B \end{bmatrix}) \:, where \mathcal{R}(\cdot) denotes the range of a matrix. A similar argument holds for the observability matrix. \square

On the other hand, the closed-loop guarantee is a weaker guarantee than A+BK being stable. This means you can have solutions to \eqref{eq:discounted_problem} for which the control law does not stabilize the system, but does yield a finite discounted cost! The simplest example of this comes from Postoyan et al. (2017), where you take A=2, B=1, and Q=R=1. In this case, when \gamma \in (0, 1/3], the closed-loop system is not stable.