Normalization: such as, avoid different unit
Initial w
- Small random numbers
Works okay for small networks, but can lead to non-homogeneous distributions of activations across the layers of a network.
- no good to use 1 or 0.01 — data will "disappear"
- Batch normalization!
Consider a batch of activations at some layer. To make each dimension unit Gaussian, apply:
$$\widehat{x}^{(k)}=\frac{x^{(k)}-\mathrm{E}\left[x^{(k)}\right]}{\sqrt{\operatorname{Var}\left[x^{(k)}\right]}}$$