Convergence of dynamic programming iterates

We are interested in the following infinite-horizon optimal control problem

$\begin{aligned}\mathbb{P}_\infty(x): \mathrm{Minimise}_{\{u_t\}_{t=0}^{\infty}, \{x_t\}_{t=0}^{\infty}} & \sum_{t=0}^{\infty}\ell(x_t, u_t)\\ \text{s.t.}\, & x_{t+1} = f(x_t, u_t), \forall t\in\mathbb{N},\\ & x_t \in X, u_t \in U, \forall t\in\mathbb{N},\\ & x_{0} = x.\end{aligned}$

We ask under what conditions the dynamic programming value iterates converge. We will state and prove a useful convergence theorem, but first we need to state some useful definitions.

Firstly, we consider the discrete-time dynamical system

$\begin{aligned}x_{t+1} = f(x_t, u_t),\end{aligned}$

with state $x_t\in\mathbb{R}^n$ and input $u_t\in\mathbb{R}^m$ , for $t\in\mathbb{N}$ .

Starting from an initial state $x_0=x$ and applying an infinite sequence of inputs $\mathbf{u} = (u_0, u_1, \ldots)$ , we obtain a sequence of states $\mathbf{x} = (x_0, x_1, \ldots)$ , with

$\begin{aligned}x_t = \phi(t; x, \mathbf{u}).\end{aligned}$

The objective is to determine a sequence $\mathbf{u}^\star$ that minimises the cost

$\begin{aligned}V_{\infty}(\mathbf{u}, x) {}={} \sum_{t=0}^{\infty} \ell(x_t, u_t),\end{aligned}$

where $\ell:\mathbb{R}^n\times\mathbb{R}^m\to\mathbb{R}_+$ is a stage cost function, subject to the constraints

$\begin{aligned}x_t {}\in{} X, \text{ and } u_t {}\in{} U,\ \forall t\in\mathbb{N}.\end{aligned}$

We say that a sequence $\mathbf{u}$ is admissible for $\mathbb{P}_\infty(x)$ if it is in

$\begin{aligned}\mathcal{U}_{\infty}(x) = \left\{ \mathbf{u} = \{u_t\}_{t\in\mathbb{N}} \left| \begin{array}{l} \phi(t; x, \mathbf{u}) \in X, \\ u_t \in U, \forall t \in \mathbb{N} \end{array} \hspace{-0.5em} \right. \right\},\end{aligned}$

where $\mathcal{U}_\infty$ is a set-valued mapping.

The set of feasible states is defined as

$\begin{aligned}X_\infty = \mathbf{dom} \mathcal{U}_\infty = \{x\in\mathbb{R}^n {}:{} \mathcal{U}_\infty(x) \neq \emptyset \}.\end{aligned}$

By definition, $X_\infty \subseteq X$ .

The infinite-horizon optimal control problem can be written as

$\begin{aligned}\mathbb{P}_\infty(x): \mathrm{Minimise}_{\mathbf{u}} \, & \, V_\infty(\mathbf{u}, x) \\ \text{s.t.}\, & \,\mathbf{u} \in \mathcal{U}_\infty(x).\end{aligned}$

Define the value function of $\mathbb{P}_\infty(x)$ as $V_\infty^\star(x) = \inf_{\mathbf{u}} \{ V_\infty(\mathbf{u}, x) {}:{} \mathbf{u} \in \mathcal{U}_\infty(x)\}$ and the solution mapping as $\mathcal{U}_\infty^\star(x) = \mathrm{argmin}_{\mathbf{u}} \{ V_\infty(\mathbf{u}, x) {}:{} \mathbf{u} \in \mathcal{U}_\infty(x)\}$ .

Important: $\mathbf{u} \in \mathcal{U}_\infty$ does not imply that $V_\infty(\mathbf{u}, x)$ is defined: Suppose that $\mathbf{u} = (u_0, u_1, \ldots)$ is such that $u_t \in U$ and and $x_t \in X$ for all $t\in\mathbb{N}$ , but $\ell(x_t, u_t) \not\to 0$ ; then

$\begin{aligned}V_\infty(\mathbf{u}, x) = \sum_{t=0}^{\infty}\ell(x_t, u_t) \text{ diverges}\end{aligned}$

Therefore,

$\begin{aligned}\mathbf{u}: \text{admissible} \not\Rightarrow V_\infty(\mathbf{u}, x) \text{ exists (converges)}\end{aligned}$

and

$\begin{aligned}x \in X_\infty \not\Rightarrow x \in \mathbf{dom} V_\infty^\star.\end{aligned}$

but

$\begin{aligned}\underbrace{\mathbf{dom} \mathcal{U}_\infty^\star \ \subseteq \ \underbrace{\mathbf{dom} V_\infty^\star}_{X_\infty^\star}}_{\text{equal under regularity assum.}} \subseteq \ \underbrace{\mathbf{dom} \mathcal{U}_\infty}_{X_\infty}\end{aligned}$

The infinite-horizon constrained OCP can be written as

$\begin{aligned}\mathbb{P}_\infty(x): \mathrm{minimise}_{\{u_t\}_{t=0}^{\infty}, \{x_t\}_{t=0}^{\infty}} & \sum_{t=0}^{N-1}\bar\ell(x_t, u_t) \\ \text{s.t.}\, & x_{t+1} = f(x_t, u_t), \forall t\in\mathbb{N}, \\ & x_{0} = x,\end{aligned}$

where $\bar\ell$ is the stage cost function plus the indicator of the state and input constraints, i.e.,

$\begin{aligned}\bar\ell(x, u) = \ell(x, u) + \delta_{X\times U}(x, u),\end{aligned}$

where

$\begin{aligned}\delta_{X\times U}(x, u) = \begin{cases}0, &\text{ for } x\in X \text{ and } u \in U\\+\infty,&\text{ otherwise}\end{cases}\end{aligned}$

Note that function $\bar\ell$ is an extended-real-valued function.

Value Iteration of the Dynamic Programming Procedure

The Value Iteration for determining

$\begin{aligned}V_\infty^\star(x) = \inf_{\mathbf{u}}\left\{\sum_{t=0}^{\infty}\bar{\ell}(x_t, u_t) {}:{} \mathbf{u} \in \mathcal{U}_{\infty}(x)\right\},\qquad\qquad(1)\end{aligned}$

where $x_t = \phi(t; x, \mathbf{u})$ , is

$\begin{aligned}\boxed{V_{t+1}^{\star}(x) = \mathbf{T} V_{t}^{\star}(x)}\qquad\qquad(2)\end{aligned}$

starting with a function $V_0^\star$ (this is the terminal cost function of a finite-horizon optimal control problem) and for a function $V:\mathbb{R}^n\to[0, \infty]$ we define the dynamic programming operator

$\begin{aligned}(\mathbf{T} V)(x) = \inf_u \{\bar\ell(x, u) + V(f(x, u))\}.\qquad\qquad(3)\end{aligned}$

Under certain conditions (on $\bar\ell$ , $f$ and $V_0^\star$ ), the value iteration of Equation (2) converges (more on that later).

A notable property of $\mathbf{T}$ is the following monotonicity property:

$\begin{aligned}V \leq V' \Rightarrow \mathbf{T}V \leq \mathbf{T}V'.\end{aligned}$

As a sidenote, recall that in the above equation, $V$ and $V'$ are extended-real-valued functions, $V,V':\mathbb{R}^n\to\bar{\mathbb{R}}_+$ , so the relation $V \leq V'$ means that (i) $V(x) \leq V'(x)$ for all $x\in\mathbf{dom} V$ and $x\in\mathbf{dom} V'$ , and (ii) $\mathbf{dom}V \supseteq \mathbf{dom}V'$ .

Suppose $V_0^\star:\mathbb{R}^n\to\bar{\mathbb{R}}_+$ has a domain $X_0$ . Then,

$\begin{aligned}V_1^\star(x) = \inf_{u \in U}\{\ell(x, u) + V_0^\star(f(x, u))\},\ x \in X_1,\end{aligned}$

where

$X_1 = \{x\in X{}:{} \exists u \in U, \text{ such that } f(x, u) \in X_0 \}.$

Recursively,

$V_{t+1}^\star(x) = \inf_{u \in U}\{\ell(x, u) + V_t^\star(f(x, u))\},\ x \in X_{t+1},$

where

$\begin{aligned}X_{t+1} = \{x\in X{}:{} \exists u \in U, \text{ such that } f(x, u) \in X_t \}.\end{aligned}$

Note: $X_{t+1}$ is the set of states that can move to $X_t$ in one step. $X_t$ is the set of states that can be steered to $X_0$ in $t$ steps.

Hereafter we shall assume that for all $x \in X_N$ the minimum in the definition of $\mathbf{T}V_t^\star$ in Equation (2) is attained.

A Note on Notation

We have denoted the DP iterates by $V_0^\star, V_1^\star, \ldots$ . The value function of the optimal control problem has been denoted by $V_\infty^\star$ (defined in Equation (1)), which looks disturbingly as if it denotes the limit of $V_t^\star$ as $t\to\infty$ . We will denote this limit by $\lim_t V_t^\star$ .

Without any assumptions on the problem data we don’t know whether the limit $\lim_t V_t^\star$ and let alone whether it converges to $V_\infty^\star$ .

Spoilers: under certain conditions (some weak regularity conditions plus some conditions on $V_0^\star$ ) we will be able to show that the limit exists and $\lim_t V_t^\star = V_\infty^\star$ .

Bellman’s Equation

The celebrated Bellman’s equation states that $V_\infty^\star = \mathbf{T}V_\infty^\star$ ; this is a functional equation that must be satisfied by $V_\infty^\star$ ; it is a necessary, but not sufficient condition for optimality. This means that there can be other functions, which are fixed points of $\mathbf{T}$ (i.e., they satisfy Bellman’s equation).

Under certain conditions, the value iteration converges to a limit, $\lim_t V_t^\star$ , which satisfies Bellman’s equations. We will state such conditions in the next section.

The following theorem states some important facts about the set of fixed points of the dynamic programming operator, $\mathbf{T}$ : it shows that $V_\infty^\star$ is a “minimal” fixed point of $\mathbf{T}$ in the sense: if $V$ is a fixed point of $\mathbf{T}$ , i.e., $\mathbf{T}V = V$ , then $V \geq V_\infty^\star$ .

Theorem 1 (Fixed points of DP operator). Suppose that $V:\mathbb{R}^n\to\bar{\mathbb{R}}_+$ is a function such that for all $x \in X$ it is

$\begin{aligned}V(x) \geq (\mathbf{T}V)(x),\end{aligned}$

where the minimum is attained and $\kappa(x)$ is a minimiser of the optimisation problem that defines $\mathbf{T}$ , i.e., $\kappa(x) \in \mathbf{argmin}_u \{\ell(x, u) + V(f(x, u))\}$ . Then

$\begin{aligned}V_\infty^\star(x) \leq \sum_{t=0}^{\infty}\bar{\ell}(x_t, \kappa(x_t)) \leq V(x),\end{aligned}$

where $x_t = \phi(t; x)$ .

Proof. Clearly, by the definition of $\kappa$ , we have $V(x) \geq \bar{\ell}(x, \kappa(x)) + V(f(x, \kappa(x))$ , therefore, $V(x_t) \geq \bar{\ell}(x_t, \kappa(x_t)) + V(f(x_t, \kappa(x_t))$ and summing up from $t=0$ to $t=N-1$ we have

$\begin{aligned}\sum_{t=0}^{N-1}\bar{\ell}(x_t, \kappa(x_t)) \leq V(x) - V(x_N) \leq V(x),\end{aligned}$

and taking $N \to \infty$ , we have

$\begin{aligned}\sum_{t=0}^{\infty}\bar{\ell}(x_t, \kappa(x_t)) \leq V(x).\end{aligned}$

Since $\mathbf{u} = (\kappa(x), \kappa(x_1), \kappa(x_2), \ldots)$ is attainable, we have

$\begin{aligned}V_\infty^\star(x) \leq \sum_{t=0}^{\infty}\bar{\ell}(x_t, \kappa(x_t)),\end{aligned}$

which completes the proof. $\blacksquare$

Convergence of Value Iteration

In this last section we will show that the DP iterates converge to the value function of the infinite-horizon optimal control problem when $V_0^\star$ is upper bounded by $V_\infty^\star$ . This is the case, for example, if $V_0^\star(x) = \delta_X(x)$ .

The theorem is proven by leveraging the monotonicity properties of $\mathbf{T}$ and Theorem 1.

Theorem 2 (Convergence of DP iterations). Suppose that

$\begin{aligned}V_0^\star(x) = V_f(x) + \delta_{X_f}(x),\end{aligned}$

where $V_f:\mathbb{R}^n\to \bar{\mathbb{R}}_+$ is continuous and $X_f \subseteq X$ is closed. If,

$\begin{aligned}V_0^\star(x) \leq V_\infty^\star(x),\end{aligned}$

then $\lim_{t\to\infty} V_t^\star(x)$ exists. If, additionally, $\lim_{t\to\infty} V_t^\star(x)$ is a fixed-point of $\mathbf{T}$ , then $\lim_{t\to\infty} V_t^\star(x) = V_\infty^\star$ for all $x\in X$ .

Proof. Let us first prove this for $V_0^\star(x) = \delta_X$ (i.e., $V_f=0$ and $X_f=X$ ). Then $0 = V_0^\star \leq \mathbf{T} V_0^\star = V_1^\star$ and recursively $V_0^\star \leq V_1^\star \leq V_2^\star \leq \ldots \leq V_\infty^\star$ . We have an increasing bounded sequence, therefore, it converges, i.e., the limit $\lim_t V_t^\star$ exists and $\lim_t V_t^\star = \mathbf{T}(\lim_t V_t^\star)$ [for details, see D. Bertsekas, Dynamic Programming and Optimal Control, Vol. 2, Athena Scientific, 2007; Chapter 3].

From Theorem 1, this implies that $\lim_t V_t^\star \geq V_\infty^\star$ (this is because $\lim_t V_t^\star$ satisfies Bellman’s equation. This proves the assertion when $V_0^\star(x) = \delta_X$ .

Let us denote the above sequence of functions by $(V_t^\star(\delta_X))_t$ the above sequence of functions with $V_0^\star(\delta_X) = \delta_X$ . We use this notation to indicate that $V_0^\star$ is taken to be equal to $\delta_X$ .

Suppose that $V_0^\star$ is any other function that satisfies the requirements of the theorem. Then, $V_0^\star \geq V_0^\star(\delta_X) = \delta_X$ .

From the monotonicity of $\mathbf{T}$ we can show that

$\begin{aligned}V_t^\star(\delta_X) \leq V_t^\star \leq V_{\infty}^\star.\end{aligned}$

Taking the limit as $t\to\infty$ completes the proof (by virtue of the sandwich theorem aka squeeze theorem). $\blacksquare$

Mathematix

Convergence of dynamic programming iterates