Open main menu

Humanoid Robots Wiki β

Dennis' Optimization Notes

Notes of various riffs on Gradient Descent from a perspective of neural networks.

Contents

A review of standard Gradient DescentEdit

The goal of Gradient Descent is to minimize a loss function  . To be more specific, if   is a differentiable multivariate function, we want to find the vector   that minimizes  .

Given an initial vector  , we want to “move” in the direction   where   is minimized (suppose the magnitude of   is fixed). By Cauchy’s Inequality, this is precisely when   is in the direction of  .

So given some  , we want to update in the direction of  . This motivates setting  , where   is a scalar factor. We call   the “learning rate” because it affects how fast the series   converges to the optimum. The main trouble in machine learning is to tweak the   to what “works best” in ensuring convergence, and that is one of the considerations that the remaining algorithms try to address.

Stochastic Gradient DescentEdit

In practice we don’t actually know the “true gradient”. So instead we take some datasets, say datasets   through  , and for dataset   we derive an estimated gradient  . Then we may estimate   as

 

If it is easy to compute   in general then we are golden: this is the best estimate of   we can get. But what if   are computationally expensive to compute? Then there is a tradeoff between variance and computational cost when evaluating our estimate of  .

A very low-cost (but low-accuracy) way to estimate   is just via   (or any other  ). But this is obviously problematic: we aren’t even using most of our data! A better balance can be struck as follows: to evaluate  , select   functions at random from  . Then estimate   as the average of those   functions only at that step.

Riffs on stochastic gradient descentEdit

MomentumEdit

See also “Momentum” on Distill.

In typical stochastic gradient descent, the next step we take is based solely on the gradient at the current point. This completely ignores the past gradients. However, many times it makes sense to take the past gradients into account. Of course, if we are at  , we should care about   much heavier than  . So we should weight   much more than  .

The simplest way is to weight it with a geometric approach. So when we iterate, instead of taking   to satisfy

 

like in standard gradient descent, we instead want to take   to satisfy

 

But this raises a concern: are we really going to be storing all of these terms, especially as   grows? Fortunately, we do not need to. For we may notice that

 

To put it another way, if we write  , i.e. how much   differs from   by, we may rewrite this equation as

 

Some of the benefits of using a momentum based approach:

  • most importantly, it can dramatically speed up convergence to a local minimum.
  • it makes convergence more likely in general
  • escaping local minima/saddles/plateaus (its importance is possibly contested? See this reddit thread)

RMSPropEdit

Gradient descent also often has diminishing learning rates. In order to counter this, we very broadly want to - track the past learning rates, - and if they have been low, multiply   by a scalar to increase the learning rate. - (As a side effect, if our past learning rates are quite high, we will tamper the learning rates.)

While performing our gradient descent to get  , we create and store an auxillary parameter   as follows:

 

and define

 

where   as usual is the learning rate,   is the decay rate of  , and   is a constant that also needs to be fine-tuned.

We include the constant term of   in order to ensure that the sequence   actually converges and to ensure numerical stability. If we are near the minimum, then   will be quite small, meaning the denominator   will essentially just become  . But because   will converge when   is just multiplied by a constant (this is the underlying assumption of standard gradient descent, after all), we will achieve convergence when near a minimum.

Side note: in order to get RMSProp to interoperate with stochastic gradient descent, we instead compute the sequence   for each approximated loss function  .

AdamEdit

Adam (Adaptive Moment Estimation) is a gradient descent modification that combines Momentum and RMSProp. We create two auxillary variables while iterating   (where   is the learning rate,   and   are decay parameters that need to be fine-tuned, and   is a parameter serving the same purpose as in RMSProp):

 

 

For notational convenience, we will define

 

 

Then our update function to get   is

 

It is worth noting that though this formula does not explicitly include  , it is accounted for in the   term through  .