Optimization: Stochastic Gradient Descent and Backpropagation

Recap

We have a score function: $s=f(x ; W) \stackrel{\text { e.g. }}{=} W x$
We have a loss function:
- Softmax: $L_{i}=-\log \left(\frac{e^{e y_{i}}}{\sum_{j} e^{s_{j}}}\right)$
- SVM: $L_{i}=\sum_{j \neq y_{i}} \max \left(0, s_{j}-s_{y_{i}}+1\right)$
  - Thus, the Full loss $L=\frac{1}{N} \sum_{i=1}^{N} L_{i}+R(W)$

$R(W)–\lambda R(W)$ control the strenth of regularization penalty. 当这个增大的时候会得到一个smoother的boundry, basically by limiting the flexibility of the W matrix. 让W不要太有灵活性，灵活性太高就容易过拟合训练集。训练过程就是捏神经网络形状让它拟合训练数据的过程，就像捏橡皮泥，不能太像训练数据了，会没有泛化能力。控制正则项的梯度反向传播程度。

Gradient descent

Numerical gradient: approximate, slow, easy to write
Analytic gradient: exact, fast, error-prone

In practice, we always use analytic gradient, but we check implementation with numerical gradient. This is called a gradient check.

Mini-batch Gradient Descent

Only use a small portion of the training set to compute the gradient. (with random subsets)

It just like sampling, since the loss function will give the average loss for data loss.

Goal is to estimate the gradient
Trade-off between accu and compu
No point in doing more computation if it wont change the updates

We take one step according to one sample and update the $W$ once. Then we catch another sample with a constant size of batch, take step, update $W$. Each time we may do harm to some data but in average, the loss is reducing.

GK的终末世界

Optimization: Stochastic Gradient Descent and Backpropagation

Recap

Gradient descent

Mini-batch Gradient Descent

Computational Graph

发表回复取消回复

分类

近期文章

归档

其他操作

GK的终末世界

Optimization: Stochastic Gradient Descent and Backpropagation

Recap

Gradient descent

Mini-batch Gradient Descent

Computational Graph

发表回复 取消回复

分类

近期文章

归档

其他操作

发表回复取消回复