Normalization: such as, avoid different unit

Initial w

  1. Small random numbers

Works okay for small networks, but can lead to non-homogeneous distributions of activations across the layers of a network.

  1. no good to use 1 or 0.01 — data will "disappear"
  2. Batch normalization!

Consider a batch of activations at some layer. To make each dimension unit Gaussian, apply:

$$\widehat{x}^{(k)}=\frac{x^{(k)}-\mathrm{E}\left[x^{(k)}\right]}{\sqrt{\operatorname{Var}\left[x^{(k)}\right]}}$$