Xavier (Glorot) Initialization

Xavier Initialization (also known as Glorot Initialization) is a specific strategy for setting the random starting weights of a neural network. It was proposed by Xavier Glorot and Yoshua Bengio in 2010 to solve the instability problems that plagued deep network training at the time.

Its primary goal is to keep the signal from dying out (vanishing) or blowing up (exploding) as it passes through deep layers.

1. The Intuition: The “Goldilocks” Zone

When network is initialized, we want the “loudness” (variance) of the signal to stay roughly the same from the first layer to the last.

If the weights are too small, the signal shrinks with every layer until it vanishes to $0$ .
If the weights are too large, the signal grows until it explodes or saturates.

Xavier Initialization calculates the exact random distribution needed to ensure that the variance of the outputs is the same as the variance of the inputs for any given layer.

2. The Math

To achieve this balance, Xavier initialization sets the weights $W$ based on the number of neurons entering the layer ( $n_{in}$ ) and leaving the layer ( $n_{o u t}$ ).

The algorithm draws weights from a distribution with a variance of:

$Var (W) = \frac{2}{n _{in} + n _{o u t}}$

There are two ways to implement this distribution:

A. Normal Distribution (Gaussian)

Draw weights from a centered normal distribution with standard deviation $σ$ :

$W \sim N (0, \frac{2}{n _{in} + n _{o u t}})$

B. Uniform Distribution

Draw weights from a uniform range $[- l imi t, + l imi t]$ :

$W \sim U [- \frac{6}{n _{in} + n _{o u t}}, + \frac{6}{n _{in} + n _{o u t}}]$

(Note: The $6$ appears because the variance of a uniform distribution is naturally smaller than a normal distribution, so the range must be wider to compensate.)

3. When To Use It?

This is the critical part. Xavier Initialization is not a one-size-fits-all solution. It is mathematically derived based on the assumption that activation function is linear (like the center of a Tanh or Sigmoid curve).

Activation Function	Best Initialization	Why?
Sigmoid / Tanh	Xavier (Glorot)	These functions are symmetric around 0. Xavier keeps the signal in the “linear” middle part of the S-curve where learning is fastest.
ReLU	He Initialization	ReLU kills half the signal (values $< 0$ become 0). Xavier is too weak for ReLU; the signal dies. He Init doubles the variance to compensate.

2022-12-07-exploding-gradient

A Personal Journal of Learning and Discovery

Explorer

2022a⁝ Xavier (Glorot) Initialization