8
edits
Changes
Initial notes on optimization
Notes of various riffs on Gradient Descent from a perspective of neural networks.
[[Category: Gradient Descent]]
<span id="a-review-of-standard-gradient-descent"></span>
== A review of standard Gradient Descent ==
The goal of Gradient Descent is to minimize a loss function <math display="inline">L</math>. To be more specific, if <math display="inline">L : \mathbb R^n \to \mathbb R</math> is a differentiable multivariate function, we want to find the vector <math display="inline">w</math> that minimizes <math display="inline">L(w)</math>.
Given an initial vector <math display="inline">w_0</math>, we want to “move” in the direction <math display="inline">\Delta w</math> where <math display="inline">L(w_0) - L(w_0 + \Delta w)</math> is minimized (suppose the magnitude of <math display="inline">\Delta w</math> is fixed). By Cauchy’s Inequality, this is precisely when <math display="inline">\Delta w</math> is in the direction of <math display="inline">-\nabla L(w_0)</math>.
So given some <math display="inline">w_n</math>, we want to update in the direction of <math display="inline">-\alpha \nabla L(w_n)</math>. This motivates setting <math display="inline">w_{n+1} = w_n - \alpha \nabla L(w_n)</math>, where <math display="inline">\alpha</math> is a scalar factor. We call <math display="inline">\alpha</math> the “learning rate” because it affects how fast the series <math display="inline">w_n</math> converges to the optimum. The main trouble in machine learning is to tweak the <math display="inline">\alpha</math> to what “works best” in ensuring convergence, and that is one of the considerations that the remaining algorithms try to address.
<span id="stochastic-gradient-descent"></span>
== Stochastic Gradient Descent ==
In practice we don’t actually know the “true gradient”. So instead we take some datasets, say datasets <math display="inline">1</math> through <math display="inline">n</math>, and for dataset <math display="inline">i</math> we derive an estimated gradient <math display="inline">\nabla L_i</math>. Then we may estimate <math display="inline">\nabla L</math> as
<math display="block">\frac{\nabla L_1 + \cdots + \nabla L_n}{n}.</math>
If it is easy to compute <math display="inline">\nabla L_i(w)</math> in general then we are golden: this is the best estimate of <math display="inline">L</math> we can get. But what if <math display="inline">\nabla L_i</math> are computationally expensive to compute? Then there is a tradeoff between variance and computational cost when evaluating our estimate of <math display="inline">\nabla L</math>.
A very low-cost (but low-accuracy) way to estimate <math display="inline">\nabla L</math> is just via <math display="inline">\nabla L_1</math> (or any other <math display="inline">\nabla L_i</math>). But this is obviously problematic: we aren’t even using most of our data! A better balance can be struck as follows: to evaluate <math display="inline">\nabla L(w_n)</math>, select <math display="inline">k</math> functions at random from <math display="inline">\{\nabla L_1, \ldots, \nabla L_n\}</math>. Then estimate <math display="inline">\nabla L</math> as the average of those <math display="inline">k</math> functions ''only at that step''.
<span id="riffs-on-stochastic-gradient-descent"></span>
== Riffs on stochastic gradient descent ==
<span id="momentum"></span>
=== Momentum ===
See also [https://distill.pub/2017/momentum/ “Momentum” on Distill].
In typical stochastic gradient descent, the next step we take is based solely on the gradient at the current point. This completely ignores the past gradients. However, many times it makes sense to take the past gradients into account. Of course, if we are at <math display="inline">w_{100}</math>, we should care about <math display="inline">\nabla L(w_{99})</math> much heavier than <math display="inline">\nabla L(w_1)</math>. So we should weight <math display="inline">\nabla L(w_{99})</math> much more than <math display="inline">\nabla L(w_1)</math>.
The simplest way is to weight it with a geometric approach. So when we iterate, instead of taking <math display="inline">w_{n+1}</math> to satisfy
<math display="block">w_{n+1} - w_n = -\alpha \nabla L(w_n)</math>
like in standard gradient descent, we instead want to take <math display="inline">w_{n+1}</math> to satisfy
<math display="block">w_{n+1} - w_n = -\alpha \nabla L(w_n) - \beta \alpha \nabla L(w_{n-1}) - \cdots - \beta^n \nabla L(w_0).</math>
But this raises a concern: are we really going to be storing all of these terms, especially as <math display="inline">n</math> grows? Fortunately, we do not need to. For we may notice that
<math display="block">w_{n+1} - w_n = -\alpha \nabla L(w_n) - \beta (\alpha \nabla L(w_{n-1}) - \cdots - \beta^{n-1} L(w_0)) = -\alpha \nabla L(w_n) - \beta(w_n - w_{n-1}).</math>
To put it another way, if we write <math display="inline">w_n - w_{n-1} = \Delta w_n</math>, i.e. how much <math display="inline">w_n</math> differs from <math display="inline">w_{n-1}</math> by, we may rewrite this equation as
<math display="block">\Delta w_{n+1} = -\alpha \nabla L(w_n) + \beta \Delta w_n.</math>
Some of the benefits of using a momentum based approach:
* most importantly, ''it can dramatically speed up convergence to a local minimum''.
* it makes convergence more likely in general
* escaping local minima/saddles/plateaus (its importance is possibly contested? See [https://www.reddit.com/r/MachineLearning/comments/dqbp9g/d_momentum_methods_helps_to_escape_local_minima/ this reddit thread])
<span id="rmsprop"></span>
=== RMSProp ===
Gradient descent also often has diminishing learning rates. In order to counter this, we very broadly want to - track the past learning rates, - and if they have been low, multiply <math display="inline">\Delta w_{n+1}</math> by a scalar to increase the learning rate. - (As a side effect, if our past learning rates are quite high, we will tamper the learning rates.)
While performing our gradient descent to get <math display="inline">w_n \to w_{n+1}</math>, we create and store an auxillary parameter <math display="inline">v_{n+1}</math> as follows:
<math display="block">v_{n+1} = \beta v_n + (1 - \beta) \nabla L(w)^2</math>
and define
<math display="block">w_{n+1} = w_n - \frac{\alpha}{\sqrt{v_n} + \epsilon} L(w),</math>
where <math display="inline">\alpha</math> as usual is the learning rate, <math display="inline">\beta</math> is the decay rate of <math display="inline">v_n</math>, and <math display="inline">\epsilon</math> is a constant that also needs to be fine-tuned.
We include the constant term of <math display="inline">\epsilon</math> in order to ensure that the sequence <math display="inline">w_n</math> actually converges and to ensure numerical stability. If we are near the minimum, then <math display="inline">v_n</math> will be quite small, meaning the denominator <math display="inline">\sqrt{v_n} + \epsilon</math> will essentially just become <math display="inline">\sqrt{v_n}</math>. But because <math display="inline">w</math> will converge when <math display="inline">L(w)</math> is just multiplied by a constant (this is the underlying assumption of standard gradient descent, after all), we will achieve convergence when near a minimum.
Side note: in order to get RMSProp to interoperate with stochastic gradient descent, we instead compute the sequence <math display="inline">v_n</math> for each approximated loss function <math display="inline">L_i</math>.
<span id="adam"></span>
=== Adam ===
Adam ('''Ada'''ptive '''M'''oment Estimation) is a gradient descent modification that combines Momentum and RMSProp. We create two auxillary variables while iterating <math display="inline">w_n</math> (where <math display="inline">\alpha</math> is the learning rate, <math display="inline">\beta_1</math> and <math display="inline">\beta_2</math> are decay parameters that need to be fine-tuned, and <math display="inline">\epsilon</math> is a parameter serving the same purpose as in RMSProp):
<math display="block">m_{n+1} = \beta_1 m_n + (1 - \beta_1) \nabla L(w_n)</math>
<math display="block">v_{n+1} = \beta_2 v_n + (1 - \beta_2) \nabla L(w_n)^2.</math>
For notational convenience, we will define
<math display="block">\widehat{m}_n = \frac{m_n}{1 - \beta_1^n}</math>
<math display="block">\widehat{v}_n = \frac{v_n}{1 - \beta_2^n}.</math>
Then our update function to get <math display="inline">w_{n+1}</math> is
<math display="block">w_{n+1} = w_n - \alpha \frac{\widehat{m}_n}{\sqrt{\widehat{v}_w} + \epsilon}.</math>
It is worth noting that though this formula does not explicitly include <math display="inline">\nabla L(w_n)</math>, it is accounted for in the <math display="inline">\widehat{m}_n</math> term through <math display="inline">m_n</math>.
[[Category: Gradient Descent]]
<span id="a-review-of-standard-gradient-descent"></span>
== A review of standard Gradient Descent ==
The goal of Gradient Descent is to minimize a loss function <math display="inline">L</math>. To be more specific, if <math display="inline">L : \mathbb R^n \to \mathbb R</math> is a differentiable multivariate function, we want to find the vector <math display="inline">w</math> that minimizes <math display="inline">L(w)</math>.
Given an initial vector <math display="inline">w_0</math>, we want to “move” in the direction <math display="inline">\Delta w</math> where <math display="inline">L(w_0) - L(w_0 + \Delta w)</math> is minimized (suppose the magnitude of <math display="inline">\Delta w</math> is fixed). By Cauchy’s Inequality, this is precisely when <math display="inline">\Delta w</math> is in the direction of <math display="inline">-\nabla L(w_0)</math>.
So given some <math display="inline">w_n</math>, we want to update in the direction of <math display="inline">-\alpha \nabla L(w_n)</math>. This motivates setting <math display="inline">w_{n+1} = w_n - \alpha \nabla L(w_n)</math>, where <math display="inline">\alpha</math> is a scalar factor. We call <math display="inline">\alpha</math> the “learning rate” because it affects how fast the series <math display="inline">w_n</math> converges to the optimum. The main trouble in machine learning is to tweak the <math display="inline">\alpha</math> to what “works best” in ensuring convergence, and that is one of the considerations that the remaining algorithms try to address.
<span id="stochastic-gradient-descent"></span>
== Stochastic Gradient Descent ==
In practice we don’t actually know the “true gradient”. So instead we take some datasets, say datasets <math display="inline">1</math> through <math display="inline">n</math>, and for dataset <math display="inline">i</math> we derive an estimated gradient <math display="inline">\nabla L_i</math>. Then we may estimate <math display="inline">\nabla L</math> as
<math display="block">\frac{\nabla L_1 + \cdots + \nabla L_n}{n}.</math>
If it is easy to compute <math display="inline">\nabla L_i(w)</math> in general then we are golden: this is the best estimate of <math display="inline">L</math> we can get. But what if <math display="inline">\nabla L_i</math> are computationally expensive to compute? Then there is a tradeoff between variance and computational cost when evaluating our estimate of <math display="inline">\nabla L</math>.
A very low-cost (but low-accuracy) way to estimate <math display="inline">\nabla L</math> is just via <math display="inline">\nabla L_1</math> (or any other <math display="inline">\nabla L_i</math>). But this is obviously problematic: we aren’t even using most of our data! A better balance can be struck as follows: to evaluate <math display="inline">\nabla L(w_n)</math>, select <math display="inline">k</math> functions at random from <math display="inline">\{\nabla L_1, \ldots, \nabla L_n\}</math>. Then estimate <math display="inline">\nabla L</math> as the average of those <math display="inline">k</math> functions ''only at that step''.
<span id="riffs-on-stochastic-gradient-descent"></span>
== Riffs on stochastic gradient descent ==
<span id="momentum"></span>
=== Momentum ===
See also [https://distill.pub/2017/momentum/ “Momentum” on Distill].
In typical stochastic gradient descent, the next step we take is based solely on the gradient at the current point. This completely ignores the past gradients. However, many times it makes sense to take the past gradients into account. Of course, if we are at <math display="inline">w_{100}</math>, we should care about <math display="inline">\nabla L(w_{99})</math> much heavier than <math display="inline">\nabla L(w_1)</math>. So we should weight <math display="inline">\nabla L(w_{99})</math> much more than <math display="inline">\nabla L(w_1)</math>.
The simplest way is to weight it with a geometric approach. So when we iterate, instead of taking <math display="inline">w_{n+1}</math> to satisfy
<math display="block">w_{n+1} - w_n = -\alpha \nabla L(w_n)</math>
like in standard gradient descent, we instead want to take <math display="inline">w_{n+1}</math> to satisfy
<math display="block">w_{n+1} - w_n = -\alpha \nabla L(w_n) - \beta \alpha \nabla L(w_{n-1}) - \cdots - \beta^n \nabla L(w_0).</math>
But this raises a concern: are we really going to be storing all of these terms, especially as <math display="inline">n</math> grows? Fortunately, we do not need to. For we may notice that
<math display="block">w_{n+1} - w_n = -\alpha \nabla L(w_n) - \beta (\alpha \nabla L(w_{n-1}) - \cdots - \beta^{n-1} L(w_0)) = -\alpha \nabla L(w_n) - \beta(w_n - w_{n-1}).</math>
To put it another way, if we write <math display="inline">w_n - w_{n-1} = \Delta w_n</math>, i.e. how much <math display="inline">w_n</math> differs from <math display="inline">w_{n-1}</math> by, we may rewrite this equation as
<math display="block">\Delta w_{n+1} = -\alpha \nabla L(w_n) + \beta \Delta w_n.</math>
Some of the benefits of using a momentum based approach:
* most importantly, ''it can dramatically speed up convergence to a local minimum''.
* it makes convergence more likely in general
* escaping local minima/saddles/plateaus (its importance is possibly contested? See [https://www.reddit.com/r/MachineLearning/comments/dqbp9g/d_momentum_methods_helps_to_escape_local_minima/ this reddit thread])
<span id="rmsprop"></span>
=== RMSProp ===
Gradient descent also often has diminishing learning rates. In order to counter this, we very broadly want to - track the past learning rates, - and if they have been low, multiply <math display="inline">\Delta w_{n+1}</math> by a scalar to increase the learning rate. - (As a side effect, if our past learning rates are quite high, we will tamper the learning rates.)
While performing our gradient descent to get <math display="inline">w_n \to w_{n+1}</math>, we create and store an auxillary parameter <math display="inline">v_{n+1}</math> as follows:
<math display="block">v_{n+1} = \beta v_n + (1 - \beta) \nabla L(w)^2</math>
and define
<math display="block">w_{n+1} = w_n - \frac{\alpha}{\sqrt{v_n} + \epsilon} L(w),</math>
where <math display="inline">\alpha</math> as usual is the learning rate, <math display="inline">\beta</math> is the decay rate of <math display="inline">v_n</math>, and <math display="inline">\epsilon</math> is a constant that also needs to be fine-tuned.
We include the constant term of <math display="inline">\epsilon</math> in order to ensure that the sequence <math display="inline">w_n</math> actually converges and to ensure numerical stability. If we are near the minimum, then <math display="inline">v_n</math> will be quite small, meaning the denominator <math display="inline">\sqrt{v_n} + \epsilon</math> will essentially just become <math display="inline">\sqrt{v_n}</math>. But because <math display="inline">w</math> will converge when <math display="inline">L(w)</math> is just multiplied by a constant (this is the underlying assumption of standard gradient descent, after all), we will achieve convergence when near a minimum.
Side note: in order to get RMSProp to interoperate with stochastic gradient descent, we instead compute the sequence <math display="inline">v_n</math> for each approximated loss function <math display="inline">L_i</math>.
<span id="adam"></span>
=== Adam ===
Adam ('''Ada'''ptive '''M'''oment Estimation) is a gradient descent modification that combines Momentum and RMSProp. We create two auxillary variables while iterating <math display="inline">w_n</math> (where <math display="inline">\alpha</math> is the learning rate, <math display="inline">\beta_1</math> and <math display="inline">\beta_2</math> are decay parameters that need to be fine-tuned, and <math display="inline">\epsilon</math> is a parameter serving the same purpose as in RMSProp):
<math display="block">m_{n+1} = \beta_1 m_n + (1 - \beta_1) \nabla L(w_n)</math>
<math display="block">v_{n+1} = \beta_2 v_n + (1 - \beta_2) \nabla L(w_n)^2.</math>
For notational convenience, we will define
<math display="block">\widehat{m}_n = \frac{m_n}{1 - \beta_1^n}</math>
<math display="block">\widehat{v}_n = \frac{v_n}{1 - \beta_2^n}.</math>
Then our update function to get <math display="inline">w_{n+1}</math> is
<math display="block">w_{n+1} = w_n - \alpha \frac{\widehat{m}_n}{\sqrt{\widehat{v}_w} + \epsilon}.</math>
It is worth noting that though this formula does not explicitly include <math display="inline">\nabla L(w_n)</math>, it is accounted for in the <math display="inline">\widehat{m}_n</math> term through <math display="inline">m_n</math>.