March 21, 11:15

Normalization techniques other than batch norm:

(pics.spark-in.me/upload/aecc2c5fb356b6d803b4218fcb0bc3ec.png)

Weight normalization (used in TCN arxiv.org/abs/1602.07868):

- Decouples length of weight vectors from their direction;

- Does not introduce any dependencies between the examples in a minibatch;

- Can be applied successfully to recurrent models such as LSTMs;

- Tested only on small datasets (CIFAR + VAES + DQN);

Instance norm (used in [style transfer](arxiv.org/abs/1607.08022))

- Proposed for style transfer;

- Essentially is batch-norm for one image;

- The mean and standard-deviation are calculated per-dimension separately for each object in a mini-batch;

Layer norm (used in Transformers, [paper](arxiv.org/abs/1607.06450))

- Designed especially for sequntial networks;

- Computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case;

- The mean and standard-deviation are calculated separately over the last certain number dimensions;

- Unlike Batch Normalization and Instance Normalization, which applies scalar scale and bias for each entire channel/plane with the affine option, Layer Normalization applies per-element scale and bias;

#deep_learning

#nlp