Dennis' Optimization Notes

Notes of various riffs on Gradient Descent from a perspective of neural networks.

A review of standard Gradient Descent

The goal of Gradient Descent is to minimize a loss function ${\textstyle L}$ . To be more specific, if ${\textstyle L:\mathbb {R} ^{n}\to \mathbb {R} }$ is a differentiable multivariate function, we want to find the vector ${\textstyle w}$ that minimizes ${\textstyle L(w)}$ .

Given an initial vector ${\textstyle w_{0}}$ , we want to “move” in the direction ${\textstyle \Delta w}$ where ${\textstyle L(w_{0})-L(w_{0}+\Delta w)}$ is minimized (suppose the magnitude of ${\textstyle \Delta w}$ is fixed). By Cauchy’s Inequality, this is precisely when ${\textstyle \Delta w}$ is in the direction of ${\textstyle -\nabla L(w_{0})}$ .

So given some ${\textstyle w_{n}}$ , we want to update in the direction of ${\textstyle -\alpha \nabla L(w_{n})}$ . This motivates setting ${\textstyle w_{n+1}=w_{n}-\alpha \nabla L(w_{n})}$ , where ${\textstyle \alpha }$ is a scalar factor. We call ${\textstyle \alpha }$ the “learning rate” because it affects how fast the series ${\textstyle w_{n}}$ converges to the optimum. The main trouble in machine learning is to tweak the ${\textstyle \alpha }$ to what “works best” in ensuring convergence, and that is one of the considerations that the remaining algorithms try to address.

Stochastic Gradient Descent

In practice we don’t actually know the “true gradient”. So instead we take some datasets, say datasets ${\textstyle 1}$ through ${\textstyle n}$ , and for dataset ${\textstyle i}$ we derive an estimated gradient ${\textstyle \nabla L_{i}}$ . Then we may estimate ${\textstyle \nabla L}$ as

{\frac {\nabla L_{1}+\cdots +\nabla L_{n}}{n}}.

If it is easy to compute ${\textstyle \nabla L_{i}(w)}$ in general then we are golden: this is the best estimate of ${\textstyle L}$ we can get. But what if ${\textstyle \nabla L_{i}}$ are computationally expensive to compute? Then there is a tradeoff between variance and computational cost when evaluating our estimate of ${\textstyle \nabla L}$ .

A very low-cost (but low-accuracy) way to estimate ${\textstyle \nabla L}$ is just via ${\textstyle \nabla L_{1}}$ (or any other ${\textstyle \nabla L_{i}}$ ). But this is obviously problematic: we aren’t even using most of our data! A better balance can be struck as follows: to evaluate ${\textstyle \nabla L(w_{n})}$ , select ${\textstyle k}$ functions at random from ${\textstyle \{\nabla L_{1},\ldots ,\nabla L_{n}\}}$ . Then estimate ${\textstyle \nabla L}$ as the average of those ${\textstyle k}$ functions only at that step.

Riffs on stochastic gradient descent

Momentum

RMSProp

Gradient descent also often has diminishing learning rates. In order to counter this, we very broadly want to - track the past learning rates, - and if they have been low, multiply ${\textstyle \Delta w_{n+1}}$ by a scalar to increase the learning rate. - (As a side effect, if our past learning rates are quite high, we will tamper the learning rates.)

While performing our gradient descent to get ${\textstyle w_{n}\to w_{n+1}}$ , we create and store an auxillary parameter ${\textstyle v_{n+1}}$ as follows:

v_{n+1}=\beta v_{n}+(1-\beta )\nabla L(w)^{2}

and define

w_{n+1}=w_{n}-{\frac {\alpha }{{\sqrt {v_{n}}}+\epsilon }}L(w),

where ${\textstyle \alpha }$ as usual is the learning rate, ${\textstyle \beta }$ is the decay rate of ${\textstyle v_{n}}$ , and ${\textstyle \epsilon }$ is a constant that also needs to be fine-tuned.

We include the constant term of ${\textstyle \epsilon }$ in order to ensure that the sequence ${\textstyle w_{n}}$ actually converges and to ensure numerical stability. If we are near the minimum, then ${\textstyle v_{n}}$ will be quite small, meaning the denominator ${\textstyle {\sqrt {v_{n}}}+\epsilon }$ will essentially just become ${\textstyle {\sqrt {v_{n}}}}$ . But because ${\textstyle w}$ will converge when ${\textstyle L(w)}$ is just multiplied by a constant (this is the underlying assumption of standard gradient descent, after all), we will achieve convergence when near a minimum.

Side note: in order to get RMSProp to interoperate with stochastic gradient descent, we instead compute the sequence ${\textstyle v_{n}}$ for each approximated loss function ${\textstyle L_{i}}$ .

Adam

Adam (Adaptive Moment Estimation) is a gradient descent modification that combines Momentum and RMSProp. We create two auxillary variables while iterating ${\textstyle w_{n}}$ (where ${\textstyle \alpha }$ is the learning rate, ${\textstyle \beta _{1}}$ and ${\textstyle \beta _{2}}$ are decay parameters that need to be fine-tuned, and ${\textstyle \epsilon }$ is a parameter serving the same purpose as in RMSProp):

m_{n+1}=\beta _{1}m_{n}+(1-\beta _{1})\nabla L(w_{n})

v_{n+1}=\beta _{2}v_{n}+(1-\beta _{2})\nabla L(w_{n})^{2}.

For notational convenience, we will define

{\widehat {m}}_{n}={\frac {m_{n}}{1-\beta _{1}^{n}}}

{\widehat {v}}_{n}={\frac {v_{n}}{1-\beta _{2}^{n}}}.

Then our update function to get ${\textstyle w_{n+1}}$ is

w_{n+1}=w_{n}-\alpha {\frac {{\widehat {m}}_{n}}{{\sqrt {{\widehat {v}}_{w}}}+\epsilon }}.

It is worth noting that though this formula does not explicitly include ${\textstyle \nabla L(w_{n})}$ , it is accounted for in the ${\textstyle {\widehat {m}}_{n}}$ term through ${\textstyle m_{n}}$ .

Dennis' Optimization Notes

Contents

A review of standard Gradient Descent

Stochastic Gradient Descent

Riffs on stochastic gradient descent

Momentum

RMSProp

Adam

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools