**Normalization techniques other than batch norm:**

(pics.spark-in.me/upload/aecc2c5f

**Weight normalization** (used in TCN arxiv.org/abs/1602.07868):

- Decouples length of weight vectors from their direction;

- Does not introduce any dependencies between the examples in a minibatch;

- Can be applied successfully to recurrent models such as LSTMs;

- Tested only on small datasets (CIFAR + VAES + DQN);

**Instance norm** (used in [style transfer](arxiv.org/abs/1607.08022))

- Proposed for style transfer;

- Essentially is batch-norm for one image;

- The mean and standard-deviation are calculated per-dimension separately for each object in a mini-batch;

**Layer norm** (used in Transformers, [paper](arxiv.org/abs/1607.06450))

- Designed especially for sequntial networks;

- Computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case;

- The mean and standard-deviation are calculated separately over the last certain number dimensions;

- Unlike Batch Normalization and Instance Normalization, which applies scalar scale and bias for each entire channel/plane with the affine option, Layer Normalization applies **per-element scale and bias**;