Convergence of the iterates of the gradient method with constant stepsize

The gradient method with constant step length is the simplest method for solving unconstrained optimisation problems involving a continuously differentiable function with Lipschitz-continuous gradient. The motivation for this post came after reading this Wikipedia article where it is stated that under certain assumptions the sequence \{x_k\} converges to a local optimum, but it is no further discussion is provided.
The gradient method with constant step size is defined as

\begin{aligned} x_{k+1} = x_{k} - \frac{1}{L}\nabla f(x_k) \end{aligned}

where L is a Lipschitz constant of \nabla f. For this method the following facts are proven and they are well known:

  1. If f is continuously differentiable with L-Lipschitz gradient (but not necessarily convex), then
    \begin{aligned} \min_{k=0,\ldots, N}\|\nabla f(x_k)\|^2 \leq \frac{2L}{N+1}(f(x_0)-f^*) \end{aligned}
    where f^* is the infimum of f (the function may not even have a minimiser).
  2. If f is continuously differentiable with L-Lipschitz gradient, its set of minimisers is nonempty and it is also convex then
    \begin{aligned} f(x_k) - f^* \leq \frac{L}{2k}\|x_0 - x^*\|^2. \end{aligned}

In the first case we should underline that \liminf \|\nabla f(x_k)\|^2 = 0, which does not mean that \|\nabla f(x_k)\| converges. In the second case where f is convex we similarly should not deduce that \{x_k\} converges to a point x^*\in\mathrm{argmin} f and neither should we draw the conclusion that \{x_k\} converges towards the set \mathrm{argmin} f in the sense d(x_k, \mathrm{argmin} f) \to 0 (where d is the distance-to-set function).

Nevertheless, we may show that the sequence \{x_k\} is Fejér with respect to \mathrm{argmin} f which will allow us to prove that, indeed, d(x_k, \mathrm{argmin} f) \to 0 as k\to\infty.

Definition (Fejér sequence). We say that a sequence \{x_k\} is Fejér monotone with respect to a set C if

\begin{aligned} \|x_{k+1}-x\|^2 \leq \|x_{k} - x\|^2, \end{aligned}

for all x \in C and k \in \mathbb{N}.

Fejér sequences possess certain favourable properties such as that they are bounded, \{\|x_k-x\|\} converges for every x\in C and d(x_k, C) is decreasing and converges.

Proposition (Fejér monotonicity). The sequence \{x_k\} produced by the gradient method with constant step size 1/L when f is convex, continuously diff/ble and with L-Lipschitz gradient, is Fejér with respect to the set of minimisers of f (assumed nonempty).

Proof. Take any minimiser x^*. Then

\begin{aligned} \|x_{k+1}-x\|^2 &= \|x_k - \frac{1}{L} \nabla f(x_k) - x^*\|^2\| \\ &= \|x_k-x^*\|^2 + \frac{1}{L}\|\nabla f(x_k)\|^2 - \frac{2}{L} \langle \nabla f(x), x-x^*\rangle \end{aligned}

Since f is convex,

\begin{aligned} &\langle \nabla f(x) - \nabla f(x^*), x -x^*\rangle \geq 0\\ \Leftrightarrow& \langle \nabla f(x), x -x^*\rangle \geq 0 \end{aligned}

and since \nabla f is L-Lipschitz

\begin{aligned} \|\nabla f(x)\|^2 &= \| \nabla f(x) - \nabla f(x^*)\|^2\\ &\leq L \|x-x^*\|^2 \end{aligned}

therefore,

\begin{aligned} \frac{1}{L^2}\|\nabla f(x_k)\|^2 \leq \frac{1}{L} \|x_k - x^*\|^2. \end{aligned}

So,

\begin{aligned} \|x_{k+1}-x^*\|^2 \leq \left(1+\frac{1}{L}\right) \|x_k - x^*\|^2 \leq\|x_k - x^*\|^2. \Box \end{aligned}

The most important property of Fejér sequences wrt to a set C is that if a cluster point of that sequence is in C, then the whole sequence converges to that point. In our case we know that \{x_k\} is bounded and has cluster points which are stationary and because f is convex, they are minimisers of our function. As a result, the whole sequence converges to a minimiser x^*.

The Fejér condition can be weakened. We may take \{x_k\} to be a quasi-Fejér sequence, that is

\begin{aligned} \|x_{k+1}-x\|^2 \leq \|x_{k} - x\|^2 + \epsilon_k, \end{aligned}

where \{\epsilon_k\} is a summable sequence of nonnegative scalars.

We may also generalise the above result when instead of a constant step size 1/L we use a variable one h_k which satisfies the Armijo condition.

Further reading:

  • H.H. Bauschke and P.L. Combettes, Convex analysis and monotone operator theory in Hilbert spaces, Springer, 2011
  • A.N. Iusem, On the convergence properties of the projected gradient method for convex optimization, Computational and Applied Mathematics 22(1):37-52, 2003.
  • L.M. Grana Drummond and B.F. Svaiter, A steepest descent method for vector optimization, Journal of Computational and Applied Mathematics 175(2):395-313, 2015.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

mathbabe

Exploring and venting about quantitative issues

Look at the corners!

The math blog of Dmitry Ostrovsky

The Unapologetic Mathematician

Mathematics for the interested outsider

Almost Sure

A random mathematical blog

Mathematix

Mathematix is the mathematician of the village of Asterix

%d bloggers like this: