Xavier (Glorot) Initialization

Xavier Initialization (also known as Glorot Initialization) is a specific strategy for setting the random starting weights of a neural network. It was proposed by Xavier Glorot and Yoshua Bengio in 2010 to solve the instability problems that plagued deep network training at the time.

Its primary goal is to keep the signal from dying out (vanishing) or blowing up (exploding) as it passes through deep layers.


1. The Intuition: The “Goldilocks” Zone

When network is initialized, we want the “loudness” (variance) of the signal to stay roughly the same from the first layer to the last.

  • If the weights are too small, the signal shrinks with every layer until it vanishes to .
  • If the weights are too large, the signal grows until it explodes or saturates.

Xavier Initialization calculates the exact random distribution needed to ensure that the variance of the outputs is the same as the variance of the inputs for any given layer.

2. The Math

To achieve this balance, Xavier initialization sets the weights based on the number of neurons entering the layer () and leaving the layer ().

The algorithm draws weights from a distribution with a variance of:

There are two ways to implement this distribution:

A. Normal Distribution (Gaussian)

Draw weights from a centered normal distribution with standard deviation :

B. Uniform Distribution

Draw weights from a uniform range :

(Note: The appears because the variance of a uniform distribution is naturally smaller than a normal distribution, so the range must be wider to compensate.)

3. When To Use It?

This is the critical part. Xavier Initialization is not a one-size-fits-all solution. It is mathematically derived based on the assumption that activation function is linear (like the center of a Tanh or Sigmoid curve).

Activation FunctionBest InitializationWhy?
Sigmoid / TanhXavier (Glorot)These functions are symmetric around 0. Xavier keeps the signal in the “linear” middle part of the S-curve where learning is fastest.
ReLUHe InitializationReLU kills half the signal (values become 0). Xavier is too weak for ReLU; the signal dies. He Init doubles the variance to compensate.

2022-12-07-exploding-gradient