Training a network
Setting the hyper-parameters
When you have defined the architecture and the weights of a DBN, you are ready to train it. You can train it manually stepping one epoch at a time, or automatically if you just want a fully trained network. In both cases, the first thing to do is to set the training hyper-parameters.
Hyper-parameters are those numbers and coefficients with which you can "customize" the training algorithm. They are called hyper to distinguish them from the weights of the DBN, that are sometimes called "parameters". For instance, Contrastive Divergence depends on the following hyper-parameters:
- Maximum number of epochs. One epoch is the time in which a network sees all the examples in a training set; the number of epochs is the number of opportunities that a DBN has of observing each example.
- Size of a mini-batch. The training set can be divided in little subsets, known as mini-batches; the DBN will update its weights only after having seen all the examples in a mini-batch; this allows for a faster and more precise learning. The size of a mini-batch must divide the number of total examples. Note that a mini-batch size of 1 is equivalent to online training; in contrast, setting the size to the exact number of examples is equivalent to batch training, or "one shot" training.
- Learning rate. The learning rate is a coefficient in the range [0, 1] that models the plasticity of the network, i.e. how easy it is to update the weights of the DBN.
- Momentum. If a DBN learning from a dataset was a ball rolling down a mountain to reach the valley (the minimum) the momentum would simulate gravity, i.e. the acceleration that speeds the ball proportionally to the steepness of the descent. The momentum ranges from 0 to 1.
- Weight decay factor. The weight decay factor (in the range [0, 1]) makes the increase of big weights more difficult than the increase of little weights. This avoids increasing the weights indefinitely.
- Sparsity target. The sparsity target (in the range (0, 1]) is how sparse we would like the network to be; a network is more sparse when it has fewer active units — when the hidden representations are more localistic.
- Standard deviation of the weights distribution. We have already seen this hyper-parameter while building a DBN; this is more related to the DBN than to the training algorithm, but it's nonetheless a hyper-parameter setting the standard deviation of the probability distribution within which the weights are initialized.
At the right of the "std. dev." field you will find a "init" button. It is not necessary to initialize the weights before training the network, because they will be initialized automatically anyway. This button is meant to analyse the network before the training.
Launching the training
When you have chosen the hyper-parameters (or if you want to accept the default ones), you can start training the DBN with one of the two buttons after the "Hyper-parameters" fieldset:
- "1 epoch" trains the network for just one epoch: use this button if you want to train the DBN in a step-by-step fashion.
- "all epochs" trains the network for the number of epochs that you have specified in the "max. epochs" field.
While training the network, you will see that a chart is updated plotting the reconstruction error against the number of epochs.