Daniel R. Kim, MD Posts About

Batch Normalization Intuition

11 May 2017

Some comments on LeCun 1998 which gives a nice explanation of batch normalization.

Suppose we have an input vector $x$ of length $k$ to a weight layer $W$. Consider one node $w$ in the layer; it is a vector of $k$ weights. In backprop, the partial derivative of the loss function with respect to $w_i$ is proportional to $x_i$. So, the signs of the input variables affects the gradient update of the weights.

We want the weights of the node to converge (ie find a solution). Suppose we added a constant to all the input examples such that every example was positive for all input variables. Then since each term in the linear combination $wx$ has the same parent in the computational graph, and since the sign of each input variable is positive, each $w_i$ will update in the same direction. Assuming online training, each update will make $w$ move in a zigzag manner along some vector with all positive components, which is inefficient.

In general, $w$ will be biased in the direction of the mean input vector. So, we zero-center each variable to allow $w$ to update freely.

This phenomenon of slowing down convergence applies to neurons using any activation function. I find this justification for removing “internal covariate shift” more compelling than the example given in the Batch Normalization paper (Ioffe 2015), which talks about how sigmoid neurons can saturate/die due to inputs shifting away from zero mean such that you get high magnitude linear combinations.

Batch normalization tries to zero-center and normalize outputs of layers that are inputs to other layers. The problem is that, since we don’t know what the parameters of the network will be prior to training, we can’t manually shift and scale layer outputs. What we do know is that at every step of training, the mini batch outputs at a layer are a sample of the distribution at that time, and so you can estimate the layer statistics (expectation and variance) and then shift and scale prior to the activation. The BN layers also somehow keep track of these statistics so that at test time you don’t need mini-batches.