We are interested in the following infinite-horizon optimal control problem
We ask under what conditions the dynamic programming value iterates converge. We will state and prove a useful convergence theorem, but first we need to state some useful definitions.
Firstly, we consider the discrete-time dynamical system
with state and input , for .
Starting from an initial state and applying an infinite sequence of inputs , we obtain a sequence of states , with
The objective is to determine a sequence that minimises the cost
where is a stage cost function, subject to the constraints
We say that a sequence is admissible for if it is in
where is a set-valued mapping.
The set of feasible states is defined as
By definition, .
The infinite-horizon optimal control problem can be written as
Define the value function of as and the solution mapping as .
Important: does not imply that is defined: Suppose that is such that and and for all , but ; then
Therefore,
and
but
The infinite-horizon constrained OCP can be written as
where is the stage cost function plus the indicator of the state and input constraints, i.e.,
where
Note that function is an extended-real-valued function.
Value Iteration of the Dynamic Programming Procedure
The Value Iteration for determining
where , is
starting with a function (this is the terminal cost function of a finite-horizon optimal control problem) and for a function we define the dynamic programming operator
Under certain conditions (on , and ), the value iteration of Equation (2) converges (more on that later).
A notable property of is the following monotonicity property:
As a sidenote, recall that in the above equation, and are extended-real-valued functions, , so the relation means that (i) for all and , and (ii) .
Suppose has a domain . Then,
where
Recursively,
where
Note: is the set of states that can move to in one step. is the set of states that can be steered to in steps.
Hereafter we shall assume that for all the minimum in the definition of in Equation (2) is attained.
A Note on Notation
We have denoted the DP iterates by . The value function of the optimal control problem has been denoted by (defined in Equation (1)), which looks disturbingly as if it denotes the limit of as . We will denote this limit by .
Without any assumptions on the problem data we don’t know whether the limit and let alone whether it converges to .
Spoilers: under certain conditions (some weak regularity conditions plus some conditions on ) we will be able to show that the limit exists and .
Bellman’s Equation
The celebrated Bellman’s equation states that ; this is a functional equation that must be satisfied by ; it is a necessary, but not sufficient condition for optimality. This means that there can be other functions, which are fixed points of (i.e., they satisfy Bellman’s equation).
Under certain conditions, the value iteration converges to a limit, , which satisfies Bellman’s equations. We will state such conditions in the next section.
The following theorem states some important facts about the set of fixed points of the dynamic programming operator, : it shows that is a “minimal” fixed point of in the sense: if is a fixed point of , i.e., , then .
Theorem 1 (Fixed points of DP operator). Suppose that is a function such that for all it is
where the minimum is attained and is a minimiser of the optimisation problem that defines , i.e., . Then
where .
Proof. Clearly, by the definition of , we have , therefore, and summing up from to we have
and taking , we have
Since is attainable, we have
which completes the proof.
Convergence of Value Iteration
In this last section we will show that the DP iterates converge to the value function of the infinite-horizon optimal control problem when is upper bounded by . This is the case, for example, if .
The theorem is proven by leveraging the monotonicity properties of and Theorem 1.
Theorem 2 (Convergence of DP iterations). Suppose that
where is continuous and is closed. If,
then exists. If, additionally, is a fixed-point of , then for all .
Proof. Let us first prove this for (i.e., and ). Then and recursively . We have an increasing bounded sequence, therefore, it converges, i.e., the limit exists and [for details, see D. Bertsekas, Dynamic Programming and Optimal Control, Vol. 2, Athena Scientific, 2007; Chapter 3].
From Theorem 1, this implies that (this is because satisfies Bellman’s equation. This proves the assertion when .
Let us denote the above sequence of functions by the above sequence of functions with . We use this notation to indicate that is taken to be equal to .
Suppose that is any other function that satisfies the requirements of the theorem. Then, .
From the monotonicity of we can show that
Taking the limit as completes the proof (by virtue of the sandwich theorem aka squeeze theorem).