1.17. Neural network models (supervised)

Warning

This implementation is not intended for large-scale applications. In particular,scikit-learn offers no GPU support. For much faster, GPU-based implementations,as well as frameworks offering much more flexibility to build deep learningarchitectures, see Related Projects.

1.17.1. Multi-layer Perceptron

Multi-layer Perceptron (MLP) is a supervised learning algorithm that learnsa function

1.17. Neural network models (supervised) - 图1 by training on a dataset,where1.17. Neural network models (supervised) - 图2 is the number of dimensions for input and1.17. Neural network models (supervised) - 图3 is thenumber of dimensions for output. Given a set of features1.17. Neural network models (supervised) - 图4and a target1.17. Neural network models (supervised) - 图5, it can learn a non-linear function approximator for eitherclassification or regression. It is different from logistic regression, in thatbetween the input and the output layer, there can be one or more non-linearlayers, called hidden layers. Figure 1 shows a one hidden layer MLP with scalaroutput.

../_images/multilayerperceptron_network.pngFigure 1 : One hidden layer MLP.

The leftmost layer, known as the input layer, consists of a set of neurons

1.17. Neural network models (supervised) - 图7 representing the input features. Eachneuron in the hidden layer transforms the values from the previous layer witha weighted linear summation1.17. Neural network models (supervised) - 图8, followedby a non-linear activation function1.17. Neural network models (supervised) - 图9

  • likethe hyperbolic tan function. The output layer receives the values from thelast hidden layer and transforms them into output values.

The module contains the public attributes coefs and intercepts.coefs_ is a list of weight matrices, where weight matrix at index

1.17. Neural network models (supervised) - 图10 represents the weights between layer1.17. Neural network models (supervised) - 图11 and layer1.17. Neural network models (supervised) - 图12. intercepts_ is a list of bias vectors, where the vectorat index1.17. Neural network models (supervised) - 图13 represents the bias values added to layer1.17. Neural network models (supervised) - 图14.

The advantages of Multi-layer Perceptron are:

  • Capability to learn non-linear models.

  • Capability to learn models in real-time (on-line learning)using partial_fit.

The disadvantages of Multi-layer Perceptron (MLP) include:

  • MLP with hidden layers have a non-convex loss function where there existsmore than one local minimum. Therefore different random weightinitializations can lead to different validation accuracy.

  • MLP requires tuning a number of hyperparameters such as the number ofhidden neurons, layers, and iterations.

  • MLP is sensitive to feature scaling.

Please see Tips on Practical Use section that addressessome of these disadvantages.

1.17.2. Classification

Class MLPClassifier implements a multi-layer perceptron (MLP) algorithmthat trains using Backpropagation.

MLP trains on two arrays: array X of size (n_samples, n_features), which holdsthe training samples represented as floating point feature vectors; and arrayy of size (n_samples,), which holds the target values (class labels) for thetraining samples:

>>>

  1. >>> from sklearn.neural_network import MLPClassifier
  2. >>> X = [[0., 0.], [1., 1.]]
  3. >>> y = [0, 1]
  4. >>> clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
  5. ... hidden_layer_sizes=(5, 2), random_state=1)
  6. ...
  7. >>> clf.fit(X, y)
  8. MLPClassifier(alpha=1e-05, hidden_layer_sizes=(5, 2), random_state=1,
  9. solver='lbfgs')

After fitting (training), the model can predict labels for new samples:

>>>

  1. >>> clf.predict([[2., 2.], [-1., -2.]])
  2. array([1, 0])

MLP can fit a non-linear model to the training data. clf.coefs_contains the weight matrices that constitute the model parameters:

>>>

  1. >>> [coef.shape for coef in clf.coefs_]
  2. [(2, 5), (5, 2), (2, 1)]

Currently, MLPClassifier supports only theCross-Entropy loss function, which allows probability estimates by running thepredict_proba method.

MLP trains using Backpropagation. More precisely, it trains using some form ofgradient descent and the gradients are calculated using Backpropagation. Forclassification, it minimizes the Cross-Entropy loss function, giving a vectorof probability estimates

1.17. Neural network models (supervised) - 图15 per sample1.17. Neural network models (supervised) - 图16:

>>>

  1. >>> clf.predict_proba([[2., 2.], [1., 2.]])
  2. array([[1.967...e-04, 9.998...-01],
  3. [1.967...e-04, 9.998...-01]])

MLPClassifier supports multi-class classification byapplying Softmaxas the output function.

Further, the model supports multi-label classificationin which a sample can belong to more than one class. For each class, the rawoutput passes through the logistic function. Values larger or equal to 0.5are rounded to 1, otherwise to 0. For a predicted output of a sample, theindices where the value is 1 represents the assigned classes of that sample:

>>>

  1. >>> X = [[0., 0.], [1., 1.]]
  2. >>> y = [[0, 1], [1, 1]]
  3. >>> clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
  4. ... hidden_layer_sizes=(15,), random_state=1)
  5. ...
  6. >>> clf.fit(X, y)
  7. MLPClassifier(alpha=1e-05, hidden_layer_sizes=(15,), random_state=1,
  8. solver='lbfgs')
  9. >>> clf.predict([[1., 2.]])
  10. array([[1, 1]])
  11. >>> clf.predict([[0., 0.]])
  12. array([[0, 1]])

See the examples below and the docstring ofMLPClassifier.fit for further information.

Examples:

1.17.3. Regression

Class MLPRegressor implements a multi-layer perceptron (MLP) thattrains using backpropagation with no activation function in the output layer,which can also be seen as using the identity function as activation function.Therefore, it uses the square error as the loss function, and the output is aset of continuous values.

MLPRegressor also supports multi-output regression, inwhich a sample can have more than one target.

1.17.4. Regularization

Both MLPRegressor and MLPClassifier use parameter alphafor regularization (L2 regularization) term which helps in avoiding overfittingby penalizing weights with large magnitudes. Following plot displays varyingdecision function with value of alpha.

../_images/sphx_glr_plot_mlp_alpha_0011.png

See the examples below for further information.

Examples:

1.17.5. Algorithms

MLP trains using Stochastic Gradient Descent,Adam, orL-BFGS.Stochastic Gradient Descent (SGD) updates parameters using the gradient of theloss function with respect to a parameter that needs adaptation, i.e.

1.17. Neural network models (supervised) - 图18

where

1.17. Neural network models (supervised) - 图19 is the learning rate which controls the step-size inthe parameter space search.1.17. Neural network models (supervised) - 图20 is the loss function usedfor the network.

More details can be found in the documentation ofSGD

Adam is similar to SGD in a sense that it is a stochastic optimizer, but it canautomatically adjust the amount to update parameters based on adaptive estimatesof lower-order moments.

With SGD or Adam, training supports online and mini-batch learning.

L-BFGS is a solver that approximates the Hessian matrix which represents thesecond-order partial derivative of a function. Further it approximates theinverse of the Hessian matrix to perform parameter updates. The implementationuses the Scipy version of L-BFGS.

If the selected solver is ‘L-BFGS’, training does not support online normini-batch learning.

1.17.6. Complexity

Suppose there are

1.17. Neural network models (supervised) - 图21 training samples,1.17. Neural network models (supervised) - 图22 features,1.17. Neural network models (supervised) - 图23hidden layers, each containing1.17. Neural network models (supervised) - 图24 neurons - for simplicity, and1.17. Neural network models (supervised) - 图25output neurons. The time complexity of backpropagation is1.17. Neural network models (supervised) - 图26, where1.17. Neural network models (supervised) - 图27 is the numberof iterations. Since backpropagation has a high time complexity, it is advisableto start with smaller number of hidden neurons and few hidden layers fortraining.

1.17.7. Mathematical formulation

Given a set of training examples

1.17. Neural network models (supervised) - 图28where1.17. Neural network models (supervised) - 图29 and1.17. Neural network models (supervised) - 图30, a one hiddenlayer one hidden neuron MLP learns the function1.17. Neural network models (supervised) - 图31where1.17. Neural network models (supervised) - 图32 and1.17. Neural network models (supervised) - 图33 aremodel parameters.1.17. Neural network models (supervised) - 图34 represent the weights of the input layer andhidden layer, respectively; and1.17. Neural network models (supervised) - 图35 represent the bias added tothe hidden layer and the output layer, respectively.1.17. Neural network models (supervised) - 图36 is the activation function, set by default asthe hyperbolic tan. It is given as,

1.17. Neural network models (supervised) - 图37

For binary classification,

1.17. Neural network models (supervised) - 图38 passes through the logistic function1.17. Neural network models (supervised) - 图39 to obtain output values between zero and one. Athreshold, set to 0.5, would assign samples of outputs larger or equal 0.5to the positive class, and the rest to the negative class.

If there are more than two classes,

1.17. Neural network models (supervised) - 图40 itself would be a vector ofsize (n_classes,). Instead of passing through logistic function, it passesthrough the softmax function, which is written as,

1.17. Neural network models (supervised) - 图41

where

1.17. Neural network models (supervised) - 图42 represents the1.17. Neural network models (supervised) - 图43 th element of the input to softmax,which corresponds to class1.17. Neural network models (supervised) - 图44, and1.17. Neural network models (supervised) - 图45 is the number of classes.The result is a vector containing the probabilities that sample1.17. Neural network models (supervised) - 图46belong to each class. The output is the class with the highest probability.

In regression, the output remains as

1.17. Neural network models (supervised) - 图47; therefore, output activationfunction is just the identity function.

MLP uses different loss functions depending on the problem type. The lossfunction for classification is Cross-Entropy, which in binary case is given as,

1.17. Neural network models (supervised) - 图48

where

1.17. Neural network models (supervised) - 图49 is an L2-regularization term (aka penalty)that penalizes complex models; and1.17. Neural network models (supervised) - 图50 is a non-negativehyperparameter that controls the magnitude of the penalty.

For regression, MLP uses the Square Error loss function; written as,

1.17. Neural network models (supervised) - 图51

Starting from initial random weights, multi-layer perceptron (MLP) minimizesthe loss function by repeatedly updating these weights. After computing theloss, a backward pass propagates it from the output layer to the previouslayers, providing each weight parameter with an update value meant to decreasethe loss.

In gradient descent, the gradient

1.17. Neural network models (supervised) - 图52 of the loss with respectto the weights is computed and deducted from1.17. Neural network models (supervised) - 图53.More formally, this is expressed as,

1.17. Neural network models (supervised) - 图54

where

1.17. Neural network models (supervised) - 图55 is the iteration step, and1.17. Neural network models (supervised) - 图56 is the learning ratewith a value larger than 0.

The algorithm stops when it reaches a preset maximum number of iterations; orwhen the improvement in loss is below a certain, small number.

1.17.8. Tips on Practical Use

  • Multi-layer Perceptron is sensitive to feature scaling, so itis highly recommended to scale your data. For example, scale eachattribute on the input vector X to [0, 1] or [-1, +1], or standardizeit to have mean 0 and variance 1. Note that you must apply the samescaling to the test set for meaningful results.You can use StandardScaler for standardization.

    >>>
    1. >>> from sklearn.preprocessing import StandardScaler # doctest: +SKIP>>> scaler = StandardScaler() # doctest: +SKIP>>> # Don't cheat - fit only on training data>>> scaler.fit(X_train) # doctest: +SKIP>>> X_train = scaler.transform(X_train) # doctest: +SKIP>>> # apply same transformation to test data>>> X_test = scaler.transform(X_test) # doctest: +SKIP

    An alternative and recommended approach is to use StandardScalerin a Pipeline

  • Finding a reasonable regularization parameter

    1.17. Neural network models (supervised) - 图57
    isbest done using GridSearchCV, usually in therange 10.0 ** -np.arange(1, 7).

  • Empirically, we observed that L-BFGS converges faster andwith better solutions on small datasets. For relatively largedatasets, however, Adam is very robust. It usually convergesquickly and gives pretty good performance. SGD with momentum ornesterov’s momentum, on the other hand, can perform better thanthose two algorithms if learning rate is correctly tuned.

1.17.9. More control with warm_start

If you want more control over stopping criteria or learning rate in SGD,or want to do additional monitoring, using warm_start=True andmax_iter=1 and iterating yourself can be helpful:

>>>

  1. >>> X = [[0., 0.], [1., 1.]]
  2. >>> y = [0, 1]
  3. >>> clf = MLPClassifier(hidden_layer_sizes=(15,), random_state=1, max_iter=1, warm_start=True)
  4. >>> for i in range(10):
  5. ... clf.fit(X, y)
  6. ... # additional monitoring / inspection
  7. MLPClassifier(...

References: