Convergence of dynamic programming iterates

We are interested in the following infinite-horizon optimal control problem

\begin{aligned}\mathbb{P}_\infty(x): \mathrm{Minimise}_{\{u_t\}_{t=0}^{\infty}, \{x_t\}_{t=0}^{\infty}} & \sum_{t=0}^{\infty}\ell(x_t, u_t)\\ \text{s.t.}\, & x_{t+1} = f(x_t, u_t), \forall t\in\mathbb{N},\\ & x_t \in X, u_t \in U, \forall t\in\mathbb{N},\\ & x_{0} = x.\end{aligned}

We ask under what conditions the dynamic programming value iterates converge. We will state and prove a useful convergence theorem, but first we need to state some useful definitions.

Firstly, we consider the discrete-time dynamical system

\begin{aligned}x_{t+1} = f(x_t, u_t),\end{aligned}

with state x_t\in\mathbb{R}^n and input u_t\in\mathbb{R}^m, for t\in\mathbb{N}.

Starting from an initial state x_0=x and applying an infinite sequence of inputs \mathbf{u} = (u_0, u_1, \ldots), we obtain a sequence of states \mathbf{x} = (x_0, x_1, \ldots), with

\begin{aligned}x_t = \phi(t; x, \mathbf{u}).\end{aligned}

The objective is to determine a sequence \mathbf{u}^\star that minimises the cost

\begin{aligned}V_{\infty}(\mathbf{u}, x) {}={} \sum_{t=0}^{\infty} \ell(x_t, u_t),\end{aligned}

where \ell:\mathbb{R}^n\times\mathbb{R}^m\to\mathbb{R}_+ is a stage cost function, subject to the constraints

\begin{aligned}x_t {}\in{} X, \text{ and } u_t {}\in{} U,\ \forall t\in\mathbb{N}.\end{aligned}

We say that a sequence \mathbf{u} is admissible for \mathbb{P}_\infty(x) if it is in

\begin{aligned}\mathcal{U}_{\infty}(x) = \left\{ \mathbf{u} = \{u_t\}_{t\in\mathbb{N}} \left| \begin{array}{l} \phi(t; x, \mathbf{u}) \in X, \\ u_t \in U, \forall t \in \mathbb{N} \end{array} \hspace{-0.5em} \right. \right\},\end{aligned}

where \mathcal{U}_\infty is a set-valued mapping.

The set of feasible states is defined as

\begin{aligned}X_\infty = \mathbf{dom} \mathcal{U}_\infty = \{x\in\mathbb{R}^n {}:{} \mathcal{U}_\infty(x) \neq \emptyset \}.\end{aligned}

By definition, X_\infty \subseteq X.

The infinite-horizon optimal control problem can be written as

\begin{aligned}\mathbb{P}_\infty(x): \mathrm{Minimise}_{\mathbf{u}} \, & \, V_\infty(\mathbf{u}, x) \\ \text{s.t.}\, & \,\mathbf{u} \in \mathcal{U}_\infty(x).\end{aligned}

Define the value function of \mathbb{P}_\infty(x) as V_\infty^\star(x) = \inf_{\mathbf{u}} \{ V_\infty(\mathbf{u}, x) {}:{} \mathbf{u} \in \mathcal{U}_\infty(x)\} and the solution mapping as \mathcal{U}_\infty^\star(x) = \mathrm{argmin}_{\mathbf{u}} \{ V_\infty(\mathbf{u}, x) {}:{} \mathbf{u} \in \mathcal{U}_\infty(x)\}.

Important: \mathbf{u} \in \mathcal{U}_\infty does not imply that V_\infty(\mathbf{u}, x) is defined: Suppose that \mathbf{u} = (u_0, u_1, \ldots) is such that u_t \in U and and x_t \in X for all t\in\mathbb{N}, but \ell(x_t, u_t) \not\to 0; then

\begin{aligned}V_\infty(\mathbf{u}, x) = \sum_{t=0}^{\infty}\ell(x_t, u_t) \text{ diverges}\end{aligned}

Therefore,

\begin{aligned}\mathbf{u}: \text{admissible} \not\Rightarrow V_\infty(\mathbf{u}, x) \text{ exists (converges)}\end{aligned}

and

\begin{aligned}x \in X_\infty \not\Rightarrow x \in \mathbf{dom} V_\infty^\star.\end{aligned}

but

\begin{aligned}\underbrace{\mathbf{dom} \mathcal{U}_\infty^\star \ \subseteq \ \underbrace{\mathbf{dom} V_\infty^\star}_{X_\infty^\star}}_{\text{equal under regularity assum.}} \subseteq \ \underbrace{\mathbf{dom} \mathcal{U}_\infty}_{X_\infty}\end{aligned}


The infinite-horizon constrained OCP can be written as

\begin{aligned}\mathbb{P}_\infty(x): \mathrm{minimise}_{\{u_t\}_{t=0}^{\infty}, \{x_t\}_{t=0}^{\infty}} & \sum_{t=0}^{N-1}\bar\ell(x_t, u_t) \\ \text{s.t.}\, & x_{t+1} = f(x_t, u_t), \forall t\in\mathbb{N}, \\ & x_{0} = x,\end{aligned}

where \bar\ell is the stage cost function plus the indicator of the state and input constraints, i.e.,

\begin{aligned}\bar\ell(x, u) = \ell(x, u) + \delta_{X\times U}(x, u),\end{aligned}

where

\begin{aligned}\delta_{X\times U}(x, u) = \begin{cases}0, &\text{ for } x\in X \text{ and } u \in U\\+\infty,&\text{ otherwise}\end{cases}\end{aligned}

Note that function \bar\ell is an extended-real-valued function.

Value Iteration of the Dynamic Programming Procedure

The Value Iteration for determining

\begin{aligned}V_\infty^\star(x) = \inf_{\mathbf{u}}\left\{\sum_{t=0}^{\infty}\bar{\ell}(x_t, u_t) {}:{} \mathbf{u} \in \mathcal{U}_{\infty}(x)\right\},\qquad\qquad(1)\end{aligned}

where x_t = \phi(t; x, \mathbf{u}), is

\begin{aligned}\boxed{V_{t+1}^{\star}(x) = \mathbf{T} V_{t}^{\star}(x)}\qquad\qquad(2)\end{aligned}

starting with a function V_0^\star (this is the terminal cost function of a finite-horizon optimal control problem) and for a function V:\mathbb{R}^n\to[0, \infty] we define the dynamic programming operator

\begin{aligned}(\mathbf{T} V)(x) = \inf_u \{\bar\ell(x, u) + V(f(x, u))\}.\qquad\qquad(3)\end{aligned}

Under certain conditions (on \bar\ell, f and V_0^\star), the value iteration of Equation (2) converges (more on that later).

A notable property of \mathbf{T} is the following monotonicity property:

\begin{aligned}V \leq V' \Rightarrow \mathbf{T}V \leq \mathbf{T}V'.\end{aligned}

As a sidenote, recall that in the above equation, V and V' are extended-real-valued functions, V,V':\mathbb{R}^n\to\bar{\mathbb{R}}_+, so the relation V \leq V' means that (i) V(x) \leq V'(x) for all x\in\mathbf{dom} V and x\in\mathbf{dom} V', and (ii) \mathbf{dom}V \supseteq \mathbf{dom}V'.

Suppose V_0^\star:\mathbb{R}^n\to\bar{\mathbb{R}}_+ has a domain X_0. Then,

\begin{aligned}V_1^\star(x) = \inf_{u \in U}\{\ell(x, u) + V_0^\star(f(x, u))\},\ x \in X_1,\end{aligned}

where

X_1 = \{x\in X{}:{} \exists u \in U, \text{ such that } f(x, u) \in X_0 \}.

Recursively,

V_{t+1}^\star(x) = \inf_{u \in U}\{\ell(x, u) + V_t^\star(f(x, u))\},\ x \in X_{t+1},

where

\begin{aligned}X_{t+1} = \{x\in X{}:{} \exists u \in U, \text{ such that } f(x, u) \in X_t \}.\end{aligned}

Note: X_{t+1} is the set of states that can move to X_t in one step. X_t is the set of states that can be steered to X_0 in t steps.

Hereafter we shall assume that for all x \in X_N the minimum in the definition of \mathbf{T}V_t^\star in Equation (2) is attained.

A Note on Notation

We have denoted the DP iterates by V_0^\star, V_1^\star, \ldots. The value function of the optimal control problem has been denoted by V_\infty^\star (defined in Equation (1)), which looks disturbingly as if it denotes the limit of V_t^\star as t\to\infty. We will denote this limit by \lim_t V_t^\star.

Without any assumptions on the problem data we don’t know whether the limit \lim_t V_t^\star and let alone whether it converges to V_\infty^\star.

Spoilers: under certain conditions (some weak regularity conditions plus some conditions on V_0^\star) we will be able to show that the limit exists and \lim_t V_t^\star = V_\infty^\star.

Bellman’s Equation

The celebrated Bellman’s equation states that V_\infty^\star = \mathbf{T}V_\infty^\star; this is a functional equation that must be satisfied by V_\infty^\star; it is a necessary, but not sufficient condition for optimality. This means that there can be other functions, which are fixed points of \mathbf{T} (i.e., they satisfy Bellman’s equation).

Under certain conditions, the value iteration converges to a limit, \lim_t V_t^\star, which satisfies Bellman’s equations. We will state such conditions in the next section.

The following theorem states some important facts about the set of fixed points of the dynamic programming operator, \mathbf{T}: it shows that V_\infty^\star is a “minimal” fixed point of \mathbf{T} in the sense: if V is a fixed point of \mathbf{T}, i.e., \mathbf{T}V = V, then V \geq V_\infty^\star.

Theorem 1 (Fixed points of DP operator). Suppose that V:\mathbb{R}^n\to\bar{\mathbb{R}}_+ is a function such that for all x \in X it is

\begin{aligned}V(x) \geq (\mathbf{T}V)(x),\end{aligned}

where the minimum is attained and \kappa(x) is a minimiser of the optimisation problem that defines \mathbf{T}, i.e., \kappa(x) \in \mathbf{argmin}_u \{\ell(x, u) + V(f(x, u))\}. Then

\begin{aligned}V_\infty^\star(x) \leq \sum_{t=0}^{\infty}\bar{\ell}(x_t, \kappa(x_t)) \leq V(x),\end{aligned}

where x_t = \phi(t; x).

Proof. Clearly, by the definition of \kappa, we have V(x) \geq \bar{\ell}(x, \kappa(x)) + V(f(x, \kappa(x)), therefore, V(x_t) \geq \bar{\ell}(x_t, \kappa(x_t)) + V(f(x_t, \kappa(x_t)) and summing up from t=0 to t=N-1 we have

\begin{aligned}\sum_{t=0}^{N-1}\bar{\ell}(x_t, \kappa(x_t)) \leq V(x) - V(x_N) \leq V(x),\end{aligned}

and taking N \to \infty, we have

\begin{aligned}\sum_{t=0}^{\infty}\bar{\ell}(x_t, \kappa(x_t)) \leq V(x).\end{aligned}

Since \mathbf{u} = (\kappa(x), \kappa(x_1), \kappa(x_2), \ldots) is attainable, we have

\begin{aligned}V_\infty^\star(x) \leq \sum_{t=0}^{\infty}\bar{\ell}(x_t, \kappa(x_t)),\end{aligned}

which completes the proof. \blacksquare

Convergence of Value Iteration

In this last section we will show that the DP iterates converge to the value function of the infinite-horizon optimal control problem when V_0^\star is upper bounded by V_\infty^\star. This is the case, for example, if V_0^\star(x) = \delta_X(x).

The theorem is proven by leveraging the monotonicity properties of \mathbf{T} and Theorem 1.

Theorem 2 (Convergence of DP iterations). Suppose that

\begin{aligned}V_0^\star(x) = V_f(x) + \delta_{X_f}(x),\end{aligned}

where V_f:\mathbb{R}^n\to \bar{\mathbb{R}}_+ is continuous and X_f \subseteq X is closed. If,

\begin{aligned}V_0^\star(x) \leq V_\infty^\star(x),\end{aligned}

then \lim_{t\to\infty} V_t^\star(x) exists. If, additionally, \lim_{t\to\infty} V_t^\star(x) is a fixed-point of \mathbf{T}, then \lim_{t\to\infty} V_t^\star(x) = V_\infty^\star for all x\in X.

Proof. Let us first prove this for V_0^\star(x) = \delta_X (i.e., V_f=0 and X_f=X). Then 0 = V_0^\star \leq \mathbf{T} V_0^\star = V_1^\star and recursively V_0^\star \leq V_1^\star \leq V_2^\star \leq \ldots \leq V_\infty^\star. We have an increasing bounded sequence, therefore, it converges, i.e., the limit \lim_t V_t^\star exists and \lim_t V_t^\star = \mathbf{T}(\lim_t V_t^\star) [for details, see D. Bertsekas, Dynamic Programming and Optimal Control, Vol. 2, Athena Scientific, 2007; Chapter 3].

From Theorem 1, this implies that \lim_t V_t^\star \geq V_\infty^\star (this is because \lim_t V_t^\star satisfies Bellman’s equation. This proves the assertion when V_0^\star(x) = \delta_X.

Let us denote the above sequence of functions by (V_t^\star(\delta_X))_t the above sequence of functions with V_0^\star(\delta_X) = \delta_X. We use this notation to indicate that V_0^\star is taken to be equal to \delta_X.

Suppose that V_0^\star is any other function that satisfies the requirements of the theorem. Then, V_0^\star \geq V_0^\star(\delta_X) = \delta_X.

From the monotonicity of \mathbf{T} we can show that

\begin{aligned}V_t^\star(\delta_X) \leq V_t^\star \leq V_{\infty}^\star.\end{aligned}

Taking the limit as t\to\infty completes the proof (by virtue of the sandwich theorem aka squeeze theorem). \blacksquare

Leave a comment

Let us learn Russian

Давайте выучим русский язык

Almost Originality

a mathematical journal

Annoying Precision

"A good stock of examples, as large as possible, is indispensable for a thorough understanding of any concept, and when I want to learn something new, I make it my first job to build one." - Paul Halmos

Alex Sisto

Alessandro Sisto's math blog

Journey into Randomness

random stuffs, mainly related to randomness

Society Of Mathematics

Make Mathematics Great Again

What's new

Updates on my research and expository papers, discussion of open problems, and other maths-related topics. By Terence Tao

Normal Deviate

Thoughts on Statistics and Machine Learning

Research and Lecture notes

by Fabrice Baudoin

mathbabe

Exploring and venting about quantitative issues

Look at the corners!

The math blog of Dmitry Ostrovsky

The Unapologetic Mathematician

Mathematics for the interested outsider

Almost Sure

A random mathematical blog

Mathematix

Mathematix is the mathematician of Erquy - the village of Asterix