2.9. Neural network models (unsupervised)

2.9.1. Restricted Boltzmann machines

Restricted Boltzmann machines (RBM) are unsupervised nonlinear feature learnersbased on a probabilistic model. The features extracted by an RBM or a hierarchyof RBMs often give good results when fed into a linear classifier such as alinear SVM or a perceptron.

The model makes assumptions regarding the distribution of inputs. At the moment,scikit-learn only provides BernoulliRBM, which assumes the inputs areeither binary values or values between 0 and 1, each encoding the probabilitythat the specific feature would be turned on.

The RBM tries to maximize the likelihood of the data using a particulargraphical model. The parameter learning algorithm used (StochasticMaximum Likelihood) prevents the representations from straying farfrom the input data, which makes them capture interesting regularities, butmakes the model less useful for small datasets, and usually not useful fordensity estimation.

The method gained popularity for initializing deep neural networks with theweights of independent RBMs. This method is known as unsupervised pre-training.

../_images/sphx_glr_plot_rbm_logistic_classification_0011.png

Examples:

2.9.1.1. Graphical model and parametrization

The graphical model of an RBM is a fully-connected bipartite graph.../_images/rbm_graph.pngThe nodes are random variables whose states depend on the state of the othernodes they are connected to. The model is therefore parameterized by theweights of the connections, as well as one intercept (bias) term for eachvisible and hidden unit, omitted from the image for simplicity.

The energy function measures the quality of a joint assignment:

2.9. Neural network models (unsupervised) - 图3

In the formula above,

2.9. Neural network models (unsupervised) - 图4 and2.9. Neural network models (unsupervised) - 图5 are theintercept vectors for the visible and hidden layers, respectively. Thejoint probability of the model is defined in terms of the energy:

2.9. Neural network models (unsupervised) - 图6

The word restricted refers to the bipartite structure of the model, whichprohibits direct interaction between hidden units, or between visible units.This means that the following conditional independencies are assumed:

2.9. Neural network models (unsupervised) - 图7

The bipartite structure allows for the use of efficient block Gibbs sampling forinference.

2.9.1.2. Bernoulli Restricted Boltzmann machines

In the BernoulliRBM, all units are binary stochastic units. Thismeans that the input data should either be binary, or real-valued between 0 and1 signifying the probability that the visible unit would turn on or off. Thisis a good model for character recognition, where the interest is on whichpixels are active and which aren’t. For images of natural scenes it no longerfits because of background, depth and the tendency of neighbouring pixels totake the same values.

The conditional probability distribution of each unit is given by thelogistic sigmoid activation function of the input it receives:

2.9. Neural network models (unsupervised) - 图8

where

2.9. Neural network models (unsupervised) - 图9 is the logistic sigmoid function:

2.9. Neural network models (unsupervised) - 图10

2.9.1.3. Stochastic Maximum Likelihood learning

The training algorithm implemented in BernoulliRBM is known asStochastic Maximum Likelihood (SML) or Persistent Contrastive Divergence(PCD). Optimizing maximum likelihood directly is infeasible because ofthe form of the data likelihood:

2.9. Neural network models (unsupervised) - 图11

For simplicity the equation above is written for a single training example.The gradient with respect to the weights is formed of two terms corresponding tothe ones above. They are usually known as the positive gradient and the negativegradient, because of their respective signs. In this implementation, thegradients are estimated over mini-batches of samples.

In maximizing the log-likelihood, the positive gradient makes the model preferhidden states that are compatible with the observed training data. Because ofthe bipartite structure of RBMs, it can be computed efficiently. Thenegative gradient, however, is intractable. Its goal is to lower the energy ofjoint states that the model prefers, therefore making it stay true to the data.It can be approximated by Markov chain Monte Carlo using block Gibbs sampling byiteratively sampling each of

2.9. Neural network models (unsupervised) - 图12 and2.9. Neural network models (unsupervised) - 图13 given the other, until thechain mixes. Samples generated in this way are sometimes referred as fantasyparticles. This is inefficient and it is difficult to determine whether theMarkov chain mixes.

The Contrastive Divergence method suggests to stop the chain after a smallnumber of iterations,

2.9. Neural network models (unsupervised) - 图14, usually even 1. This method is fast and haslow variance, but the samples are far from the model distribution.

Persistent Contrastive Divergence addresses this. Instead of starting a newchain each time the gradient is needed, and performing only one Gibbs samplingstep, in PCD we keep a number of chains (fantasy particles) that are updated

2.9. Neural network models (unsupervised) - 图15 Gibbs steps after each weight update. This allows the particles toexplore the space more thoroughly.

References: