stephentu's blog - Random assortment of things

This post will study the discrete-time, infinite-horizon, discounted LQR problem. The reference for this is Volume 2 of Bertsekas's Dynamic Programming and Optimal Control. I am writing these notes because I had a harder time finding derivations online for the discounted-cost version of LQR. $ \newcommand{\abs}[1]{| #1 |} \newcommand{\bigabs}[1]{\left| #1 \right|} \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\B}{\mathcal{B}} \newcommand{\ip}[2]{\langle #1, #2 \rangle} \newcommand{\T}{\mathsf{T}} \newcommand{\Tr}{\mathrm{Tr}} \newcommand{\ind}{\mathbf{1}} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\bmattwo}[4]{\begin{bmatrix} #1 & #2 \\ #3 & #4 \end{bmatrix}} \newcommand{\bmatcol}[2]{\begin{bmatrix} #1 \\ #2 \end{bmatrix}} $

Setup

We have a discrete-time LQR system evolving as $$ x_{k+1} = A x_k + B u_k + w_k \:, \:\: x_0 = 0 \:. $$ where $\E[w_k] = 0$, $\E[w_kw_k^\T] = \sigma^2 I$, and $w_k$ is independent from $w_{k'}$ for all $k \neq k'$. Our goal is to find a control law $u$ that minimizes $$ \begin{align} J(u) = \E\left[ \sum_{k=0}^{\infty} \gamma^k (x_k^\T Q x_k + u_k^\T R u_k) \right] \:, \label{eq:discounted_problem} \end{align} $$ with $Q$ positive-semidefinite, $R$ positive-definite, and $\gamma \in (0, 1)$. This is to be contrasted with the infinite-horizon, average-cost LQR problem which seeks to minimize $$ J(u) = \limsup_{T \to \infty} \E\left[ \frac{1}{T}\sum_{k=0}^{T-1} (x_k^\T Q x_k + u_k^\T R u_k) \right] \:. $$

Derivation

Given a control policy $\pi$, let $V^\pi(x)$ denote the value-function $$ V^\pi(x) = \E\left[ \sum_{k=0}^{\infty} \gamma^k (x_k^\T Q x_k + u_k^\T R u_k) \right] \:\: \text{s.t.} \:\: u_k = \pi(x_k) \:, $$ and let $V^\ast(x)$ denote the optimal value-function $$ V^\ast(x) = \min_{\pi} V^\pi(x) \:. $$ Bellman's optimality principle states that for all $x$, $$ \begin{align} V^\ast(x) = \min_{u} \left\{ x^\T Q x + u^\T R u + \gamma \E_{x' \sim p(\cdot| x, u)}[ V^\ast(x')] \right\} \:. \label{eq:bellman} \end{align} $$ Stipulating that $V^\ast(x) = x^\T P x + q$ for a positive-semidefinite $P$ and plugging into $\eqref{eq:bellman}$, $$ \begin{align*} &x^\T P x + (1-\gamma)q \\ &\qquad= \min_u \left\{ x^\T Q x + u^\T R u + \gamma \E_{w}[ (Ax + Bu + w)^\T P (Ax + Bu + w) ] \right\} \\ &\qquad= \min_u \left\{ x^\T Q x + u^\T R u + \gamma (Ax + Bu)^\T P (Ax + Bu) + \gamma\sigma^2\Tr(P) \right\} \\ &\qquad= \min_u \left\{ \bmatcol{x}{u}^\T \left( \bmattwo{Q}{0}{0}{R} + \gamma \bmattwo{A^\T P A}{A^\T P B}{B^\T P A}{B^\T P B} \right) \bmatcol{x}{u} \right\} + \gamma \sigma^2 \Tr(P) \\ &\qquad= x^\T ( Q + \gamma A^\T P A - \gamma^2 A^\T P B (R + \gamma B^\T P B)^{-1} B^\T P A ) x + \gamma \sigma^2 \Tr(P) \:. \end{align*} $$ The $u$ which achieves the minimum is $u^\ast = -\gamma(R + \gamma B^\T P B)^{-1} B^\T P A x$. This fact and the last equality hold from partial minimization of a quadratic. Specifically, if $P_{22}$ is positive-definite, it is straightforward to show that $$ \min_{x_2} \bmatcol{x_1}{x_2}^\T \bmattwo{P_{11}}{P_{12}}{P_{12}^\T}{P_{22}} \bmatcol{x_1}{x_2} = x_1^\T (P_{11} - P_{12} P_{22}^{-1} P_{12}^\T ) x_1 = x_1^\T (P/P_{22}) x_1 \:, $$ where $x_2 = -P_{22}^{-1} P_{12}^\T x_1$ achieves the minimum and $P/P_{22}$ denotes the Schur complement. From this, we see that we can solve for $P$ and $q$ as $$ \begin{align} P &= \gamma A^\T P A - \gamma^2 A^\T P B(R + \gamma B^\T P B)^{-1} B^\T P A + Q \:, \label{eq:soln} \\ q &= \sigma^2 \frac{\gamma}{1-\gamma} \Tr(P) \nonumber \:. \end{align} $$

Consequences

The canonical form for a discrete algebraic Ricatti equation is $\mathrm{DARE}(A,B,Q,R)$, which is the solution to $$ P = A^\T P A - A^\T P B(R + B^\T P B)^{-1} B^\T P A + Q \:. $$ It is known that if $(A,B)$ is controllable and $(A,C)$ is observable (with $Q = C^\T C$), then the solution to $\mathrm{DARE}(A,B,Q,R)$ is unique and positive-definite. Furthermore, defining $K = -(R + B^\T P B)^{-1} B^\T P A$, the resulting closed-loop matrix $A+BK$ is stable.

By a simple change of variables, we see that $\eqref{eq:soln}$ is equivalent to $\mathrm{DARE}(\sqrt{\gamma} A, B, Q, R/\gamma)$. This means that as long as $(\sqrt{\gamma}A,B)$ is controllable and $(\sqrt{\gamma}A,C)$ is observable, then the closed-loop matrix for the discounted problem satisfies $\sqrt{\gamma}(A+BK)$ is stable. First, we observe that the controllability and observability requirements are equivalent to $(A,B)$ controllable and $(A,C)$ observable:

Proposition: Let $\alpha \neq 0$. Then $(A,B)$ is controllable iff $(\alpha A, B)$ is controllable. Similarly, $(A,C)$ is observable iff $(\alpha A, C)$ is observable.

Proof: This follows immediately since $$ \mathcal{R}(\begin{bmatrix} B & A B & ... & A^{n-1} B \end{bmatrix}) = \mathcal{R}(\begin{bmatrix} B & (\alpha A) B & ... & (\alpha A)^{n-1} B \end{bmatrix}) \:, $$ where $\mathcal{R}(\cdot)$ denotes the range of a matrix. A similar argument holds for the observability matrix. $\square$

On the other hand, the closed-loop guarantee is a weaker guarantee than $A+BK$ being stable. This means you can have solutions to $\eqref{eq:discounted_problem}$ for which the control law does not stabilize the system, but does yield a finite discounted cost! The simplest example of this comes from Postoyan et al. (2017), where you take $A=2$, $B=1$, and $Q=R=1$. In this case, when $\gamma \in (0, 1/3]$, the closed-loop system is not stable.

The Discounted Infinite-Horizon LQR Problem

Setup

Derivation

Consequences