Dennis' Optimization Notes
Notes of various riffs on Gradient Descent from a perspective of neural networks.
Contents
A review of standard Gradient Descent[edit]
The goal of Gradient Descent is to minimize a loss function . To be more specific, if is a differentiable multivariate function, we want to find the vector that minimizes .
Given an initial vector , we want to “move” in the direction where is minimized (suppose the magnitude of is fixed). By Cauchy’s Inequality, this is precisely when is in the direction of .
So given some , we want to update in the direction of . This motivates setting , where is a scalar factor. We call the “learning rate” because it affects how fast the series converges to the optimum. The main trouble in machine learning is to tweak the to what “works best” in ensuring convergence, and that is one of the considerations that the remaining algorithms try to address.
Stochastic Gradient Descent[edit]
In practice we don’t actually know the “true gradient”. So instead we take some datasets, say datasets through , and for dataset we derive an estimated gradient . Then we may estimate as
If it is easy to compute in general then we are golden: this is the best estimate of we can get. But what if are computationally expensive to compute? Then there is a tradeoff between variance and computational cost when evaluating our estimate of .
A very low-cost (but low-accuracy) way to estimate is just via (or any other ). But this is obviously problematic: we aren’t even using most of our data! A better balance can be struck as follows: to evaluate , select functions at random from . Then estimate as the average of those functions only at that step.
Riffs on stochastic gradient descent[edit]
Momentum[edit]
See also “Momentum” on Distill.
In typical stochastic gradient descent, the next step we take is based solely on the gradient at the current point. This completely ignores the past gradients. However, many times it makes sense to take the past gradients into account. Of course, if we are at , we should care about much heavier than . So we should weight much more than .
The simplest way is to weight it with a geometric approach. So when we iterate, instead of taking to satisfy
like in standard gradient descent, we instead want to take to satisfy
But this raises a concern: are we really going to be storing all of these terms, especially as grows? Fortunately, we do not need to. For we may notice that
To put it another way, if we write , i.e. how much differs from by, we may rewrite this equation as
Some of the benefits of using a momentum based approach:
- most importantly, it can dramatically speed up convergence to a local minimum.
- it makes convergence more likely in general
- escaping local minima/saddles/plateaus (its importance is possibly contested? See this reddit thread)
RMSProp[edit]
Gradient descent also often has diminishing learning rates. In order to counter this, we very broadly want to - track the past learning rates, - and if they have been low, multiply by a scalar to increase the learning rate. - (As a side effect, if our past learning rates are quite high, we will tamper the learning rates.)
While performing our gradient descent to get , we create and store an auxillary parameter as follows:
and define
where as usual is the learning rate, is the decay rate of , and is a constant that also needs to be fine-tuned.
We include the constant term of in order to ensure that the sequence actually converges and to ensure numerical stability. If we are near the minimum, then will be quite small, meaning the denominator will essentially just become . But because will converge when is just multiplied by a constant (this is the underlying assumption of standard gradient descent, after all), we will achieve convergence when near a minimum.
Side note: in order to get RMSProp to interoperate with stochastic gradient descent, we instead compute the sequence for each approximated loss function .
Adam[edit]
Adam (Adaptive Moment Estimation) is a gradient descent modification that combines Momentum and RMSProp. We create two auxillary variables while iterating (where is the learning rate, and are decay parameters that need to be fine-tuned, and is a parameter serving the same purpose as in RMSProp):
For notational convenience, we will define
Then our update function to get is
It is worth noting that though this formula does not explicitly include , it is accounted for in the term through .