1.6. Nearest Neighbors

sklearn.neighbors provides functionality for unsupervised andsupervised neighbors-based learning methods. Unsupervised nearest neighborsis the foundation of many other learning methods,notably manifold learning and spectral clustering. Supervised neighbors-basedlearning comes in two flavors: classification for data withdiscrete labels, and regression for data with continuous labels.

The principle behind nearest neighbor methods is to find a predefined numberof training samples closest in distance to the new point, andpredict the label from these. The number of samples can be a user-definedconstant (k-nearest neighbor learning), or vary basedon the local density of points (radius-based neighbor learning).The distance can, in general, be any metric measure: standard Euclideandistance is the most common choice.Neighbors-based methods are known as non-generalizing machinelearning methods, since they simply “remember” all of its training data(possibly transformed into a fast indexing structure such as aBall Tree or KD Tree).

Despite its simplicity, nearest neighbors has been successful in alarge number of classification and regression problems, includinghandwritten digits and satellite image scenes. Being a non-parametric method,it is often successful in classification situations where the decisionboundary is very irregular.

The classes in sklearn.neighbors can handle either NumPy arrays orscipy.sparse matrices as input. For dense matrices, a large number ofpossible distance metrics are supported. For sparse matrices, arbitraryMinkowski metrics are supported for searches.

There are many learning routines which rely on nearest neighbors at theircore. One example is kernel density estimation,discussed in the density estimation section.

1.6.1. Unsupervised Nearest Neighbors

NearestNeighbors implements unsupervised nearest neighbors learning.It acts as a uniform interface to three different nearest neighborsalgorithms: BallTree, KDTree, and abrute-force algorithm based on routines in sklearn.metrics.pairwise.The choice of neighbors search algorithm is controlled through the keyword'algorithm', which must be one of['auto', 'ball_tree', 'kd_tree', 'brute']. When the default value'auto' is passed, the algorithm attempts to determine the best approachfrom the training data. For a discussion of the strengths and weaknessesof each option, see Nearest Neighbor Algorithms.

Warning

Regarding the Nearest Neighbors algorithms, if twoneighbors

1.6. Nearest Neighbors - 图1
and
1.6. Nearest Neighbors - 图2
have identical distancesbut different labels, the result will depend on the ordering of thetraining data.

1.6.1.1. Finding the Nearest Neighbors

For the simple task of finding the nearest neighbors between two sets ofdata, the unsupervised algorithms within sklearn.neighbors can beused:

>>>

  1. >>> from sklearn.neighbors import NearestNeighbors
  2. >>> import numpy as np
  3. >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
  4. >>> nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
  5. >>> distances, indices = nbrs.kneighbors(X)
  6. >>> indices
  7. array([[0, 1],
  8. [1, 0],
  9. [2, 1],
  10. [3, 4],
  11. [4, 3],
  12. [5, 4]]...)
  13. >>> distances
  14. array([[0. , 1. ],
  15. [0. , 1. ],
  16. [0. , 1.41421356],
  17. [0. , 1. ],
  18. [0. , 1. ],
  19. [0. , 1.41421356]])

Because the query set matches the training set, the nearest neighbor of eachpoint is the point itself, at a distance of zero.

It is also possible to efficiently produce a sparse graph showing theconnections between neighboring points:

>>>

  1. >>> nbrs.kneighbors_graph(X).toarray()
  2. array([[1., 1., 0., 0., 0., 0.],
  3. [1., 1., 0., 0., 0., 0.],
  4. [0., 1., 1., 0., 0., 0.],
  5. [0., 0., 0., 1., 1., 0.],
  6. [0., 0., 0., 1., 1., 0.],
  7. [0., 0., 0., 0., 1., 1.]])

The dataset is structured such that points nearby in index order are nearbyin parameter space, leading to an approximately block-diagonal matrix ofK-nearest neighbors. Such a sparse graph is useful in a variety ofcircumstances which make use of spatial relationships between points forunsupervised learning: in particular, see sklearn.manifold.Isomap,sklearn.manifold.LocallyLinearEmbedding, andsklearn.cluster.SpectralClustering.

1.6.1.2. KDTree and BallTree Classes

Alternatively, one can use the KDTree or BallTree classesdirectly to find nearest neighbors. This is the functionality wrapped bythe NearestNeighbors class used above. The Ball Tree and KD Treehave the same interface; we’ll show an example of using the KD Tree here:

>>>

  1. >>> from sklearn.neighbors import KDTree
  2. >>> import numpy as np
  3. >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
  4. >>> kdt = KDTree(X, leaf_size=30, metric='euclidean')
  5. >>> kdt.query(X, k=2, return_distance=False)
  6. array([[0, 1],
  7. [1, 0],
  8. [2, 1],
  9. [3, 4],
  10. [4, 3],
  11. [5, 4]]...)

Refer to the KDTree and BallTree class documentationfor more information on the options available for nearest neighbors searches,including specification of query strategies, distance metrics, etc. For a listof available metrics, see the documentation of the DistanceMetricclass.

1.6.2. Nearest Neighbors Classification

Neighbors-based classification is a type of instance-based learning ornon-generalizing learning: it does not attempt to construct a generalinternal model, but simply stores instances of the training data.Classification is computed from a simple majority vote of the nearestneighbors of each point: a query point is assigned the data class whichhas the most representatives within the nearest neighbors of the point.

scikit-learn implements two different nearest neighbors classifiers:KNeighborsClassifier implements learning based on the

1.6. Nearest Neighbors - 图3nearest neighbors of each query point, where1.6. Nearest Neighbors - 图4 is an integer valuespecified by the user. RadiusNeighborsClassifier implements learningbased on the number of neighbors within a fixed radius1.6. Nearest Neighbors - 图5 of eachtraining point, where1.6. Nearest Neighbors - 图6 is a floating-point value specified bythe user.

The

1.6. Nearest Neighbors - 图7-neighbors classification in KNeighborsClassifieris the most commonly used technique. The optimal choice of the value1.6. Nearest Neighbors - 图8is highly data-dependent: in general a larger1.6. Nearest Neighbors - 图9 suppresses the effectsof noise, but makes the classification boundaries less distinct.

In cases where the data is not uniformly sampled, radius-based neighborsclassification in RadiusNeighborsClassifier can be a better choice.The user specifies a fixed radius

1.6. Nearest Neighbors - 图10, such that points in sparserneighborhoods use fewer nearest neighbors for the classification. Forhigh-dimensional parameter spaces, this method becomes less effective dueto the so-called “curse of dimensionality”.

The basic nearest neighbors classification uses uniform weights: that is, thevalue assigned to a query point is computed from a simple majority vote ofthe nearest neighbors. Under some circumstances, it is better to weight theneighbors such that nearer neighbors contribute more to the fit. This canbe accomplished through the weights keyword. The default value,weights = 'uniform', assigns uniform weights to each neighbor.weights = 'distance' assigns weights proportional to the inverse of thedistance from the query point. Alternatively, a user-defined function of thedistance can be supplied to compute the weights.

classification_1classification_2

Examples:

1.6.3. Nearest Neighbors Regression

Neighbors-based regression can be used in cases where the data labels arecontinuous rather than discrete variables. The label assigned to a querypoint is computed based on the mean of the labels of its nearest neighbors.

scikit-learn implements two different neighbors regressors:KNeighborsRegressor implements learning based on the

1.6. Nearest Neighbors - 图13nearest neighbors of each query point, where1.6. Nearest Neighbors - 图14 is an integervalue specified by the user. RadiusNeighborsRegressor implementslearning based on the neighbors within a fixed radius1.6. Nearest Neighbors - 图15 of thequery point, where1.6. Nearest Neighbors - 图16 is a floating-point value specified by theuser.

The basic nearest neighbors regression uses uniform weights: that is,each point in the local neighborhood contributes uniformly to theclassification of a query point. Under some circumstances, it can beadvantageous to weight points such that nearby points contribute moreto the regression than faraway points. This can be accomplished throughthe weights keyword. The default value, weights = 'uniform',assigns equal weights to all points. weights = 'distance' assignsweights proportional to the inverse of the distance from the query point.Alternatively, a user-defined function of the distance can be supplied,which will be used to compute the weights.

../_images/sphx_glr_plot_regression_0011.png

The use of multi-output nearest neighbors for regression is demonstrated inFace completion with a multi-output estimators. In this example, the inputsX are the pixels of the upper half of faces and the outputs Y are the pixels ofthe lower half of those faces.

../_images/sphx_glr_plot_multioutput_face_completion_0011.png

Examples:

1.6.4. Nearest Neighbor Algorithms

1.6.4.1. Brute Force

Fast computation of nearest neighbors is an active area of research inmachine learning. The most naive neighbor search implementation involvesthe brute-force computation of distances between all pairs of points in thedataset: for

1.6. Nearest Neighbors - 图19 samples in1.6. Nearest Neighbors - 图20 dimensions, this approach scalesas1.6. Nearest Neighbors - 图21. Efficient brute-force neighbors searches can be verycompetitive for small data samples.However, as the number of samples1.6. Nearest Neighbors - 图22 grows, the brute-forceapproach quickly becomes infeasible. In the classes withinsklearn.neighbors, brute-force neighbors searches are specifiedusing the keyword algorithm = 'brute', and are computed using theroutines available in sklearn.metrics.pairwise.

1.6.4.2. K-D Tree

To address the computational inefficiencies of the brute-force approach, avariety of tree-based data structures have been invented. In general, thesestructures attempt to reduce the required number of distance calculationsby efficiently encoding aggregate distance information for the sample.The basic idea is that if point

1.6. Nearest Neighbors - 图23 is very distant from point1.6. Nearest Neighbors - 图24, and point1.6. Nearest Neighbors - 图25 is very close to point1.6. Nearest Neighbors - 图26,then we know that points1.6. Nearest Neighbors - 图27 and1.6. Nearest Neighbors - 图28are very distant, without having to explicitly calculate their distance.In this way, the computational cost of a nearest neighbors search can bereduced to1.6. Nearest Neighbors - 图29 or better. This is a significantimprovement over brute-force for large1.6. Nearest Neighbors - 图30.

An early approach to taking advantage of this aggregate information wasthe KD tree data structure (short for K-dimensional tree), whichgeneralizes two-dimensional Quad-trees and 3-dimensional _Oct-trees_to an arbitrary number of dimensions. The KD tree is a binary treestructure which recursively partitions the parameter space along the dataaxes, dividing it into nested orthotropic regions into which data pointsare filed. The construction of a KD tree is very fast: because partitioningis performed only along the data axes, no

1.6. Nearest Neighbors - 图31-dimensional distancesneed to be computed. Once constructed, the nearest neighbor of a querypoint can be determined with only1.6. Nearest Neighbors - 图32 distance computations.Though the KD tree approach is very fast for low-dimensional (1.6. Nearest Neighbors - 图33)neighbors searches, it becomes inefficient as1.6. Nearest Neighbors - 图34 grows very large:this is one manifestation of the so-called “curse of dimensionality”.In scikit-learn, KD tree neighbors searches are specified using thekeyword algorithm = 'kd_tree', and are computed using the classKDTree.

References:

1.6.4.3. Ball Tree

To address the inefficiencies of KD Trees in higher dimensions, the _ball tree_data structure was developed. Where KD trees partition data alongCartesian axes, ball trees partition data in a series of nestinghyper-spheres. This makes tree construction more costly than that of theKD tree, but results in a data structure which can be very efficient onhighly structured data, even in very high dimensions.

A ball tree recursively divides the data intonodes defined by a centroid

1.6. Nearest Neighbors - 图35 and radius1.6. Nearest Neighbors - 图36, such that eachpoint in the node lies within the hyper-sphere defined by1.6. Nearest Neighbors - 图37 and1.6. Nearest Neighbors - 图38. The number of candidate points for a neighbor searchis reduced through use of the triangle inequality:

1.6. Nearest Neighbors - 图39

With this setup, a single distance calculation between a test point andthe centroid is sufficient to determine a lower and upper bound on thedistance to all points within the node.Because of the spherical geometry of the ball tree nodes, it can out-performa KD-tree in high dimensions, though the actual performance is highlydependent on the structure of the training data.In scikit-learn, ball-tree-basedneighbors searches are specified using the keyword algorithm = 'ball_tree',and are computed using the class sklearn.neighbors.BallTree.Alternatively, the user can work with the BallTree class directly.

References:

1.6.4.4. Choice of Nearest Neighbors Algorithm

The optimal algorithm for a given dataset is a complicated choice, anddepends on a number of factors:

  • number of samples

1.6. Nearest Neighbors - 图40 (i.e. n_samples) and dimensionality1.6. Nearest Neighbors - 图41 (i.e. n_features).

  • Brute force query time grows as

1.6. Nearest Neighbors - 图42

  • Ball tree query time grows as approximately

1.6. Nearest Neighbors - 图43

  • KD tree query time changes with

1.6. Nearest Neighbors - 图44 in a way that is difficultto precisely characterise. For small1.6. Nearest Neighbors - 图45 (less than 20 or so)the cost is approximately1.6. Nearest Neighbors - 图46, and the KD treequery can be very efficient.For larger1.6. Nearest Neighbors - 图47, the cost increases to nearly1.6. Nearest Neighbors - 图48, andthe overhead due to the treestructure can lead to queries which are slower than brute force.

For small data sets (

1.6. Nearest Neighbors - 图49 less than 30 or so),1.6. Nearest Neighbors - 图50 iscomparable to1.6. Nearest Neighbors - 图51, and brute force algorithms can be more efficientthan a tree-based approach. Both KDTree and BallTreeaddress this through providing a leaf size parameter: this controls thenumber of samples at which a query switches to brute-force. This allows bothalgorithms to approach the efficiency of a brute-force computation for small1.6. Nearest Neighbors - 图52.

  • data structure: intrinsic dimensionality of the data and/or _sparsity_of the data. Intrinsic dimensionality refers to the dimension

1.6. Nearest Neighbors - 图53 of a manifold on which the data lies, which can be linearlyor non-linearly embedded in the parameter space. Sparsity refers to thedegree to which the data fills the parameter space (this is to bedistinguished from the concept as used in “sparse” matrices. The datamatrix may have no zero entries, but the structure can still be“sparse” in this sense).

  • Brute force query time is unchanged by data structure.

  • Ball tree and KD tree query times can be greatly influencedby data structure. In general, sparser data with a smaller intrinsicdimensionality leads to faster query times. Because the KD treeinternal representation is aligned with the parameter axes, it will notgenerally show as much improvement as ball tree for arbitrarilystructured data.

Datasets used in machine learning tend to be very structured, and arevery well-suited for tree-based queries.

  • number of neighbors

1.6. Nearest Neighbors - 图54 requested for a query point.

  • Brute force query time is largely unaffected by the value of

1.6. Nearest Neighbors - 图55

  • Ball tree and KD tree query time will become slower as

1.6. Nearest Neighbors - 图56increases. This is due to two effects: first, a larger1.6. Nearest Neighbors - 图57 leadsto the necessity to search a larger portion of the parameter space.Second, using1.6. Nearest Neighbors - 图58 requires internal queueing of resultsas the tree is traversed.

As

1.6. Nearest Neighbors - 图59 becomes large compared to1.6. Nearest Neighbors - 图60, the ability to prunebranches in a tree-based query is reduced. In this situation, Brute forcequeries can be more efficient.

  • number of query points. Both the ball tree and the KD Treerequire a construction phase. The cost of this construction becomesnegligible when amortized over many queries. If only a small number ofqueries will be performed, however, the construction can make upa significant fraction of the total cost. If very few query pointswill be required, brute force is better than a tree-based method.

Currently, algorithm = 'auto' selects 'brute' if

1.6. Nearest Neighbors - 图61,the input data is sparse, or effectivemetric isn’t inthe VALIDMETRICS list for either 'kd_tree' or 'ball_tree'.Otherwise, it selects the first out of 'kd_tree' and 'ball_tree'that has effective_metric in its VALID_METRICS list.This choice is based on the assumption that the number of query points is atleast the same order as the number of training points, and that leaf_sizeis close to its default value of 30.

1.6.4.5. Effect of leaf_size

As noted above, for small sample sizes a brute force search can be moreefficient than a tree-based query. This fact is accounted for in the balltree and KD tree by internally switching to brute force searches withinleaf nodes. The level of this switch can be specified with the parameterleaf_size. This parameter choice has many effects:

  • construction time
  • A larger leaf_size leads to a faster tree construction time, becausefewer nodes need to be created

  • query time

  • Both a large or small leaf_size can lead to suboptimal query cost.For leaf_size approaching 1, the overhead involved in traversingnodes can significantly slow query times. For leaf_size approachingthe size of the training set, queries become essentially brute force.A good compromise between these is leaf_size = 30, the default valueof the parameter.

  • memory

  • As leaf_size increases, the memory required to store a tree structuredecreases. This is especially important in the case of ball tree, whichstores a

1.6. Nearest Neighbors - 图62-dimensional centroid for each node. The requiredstorage space for BallTree is approximately 1 / leaf_size timesthe size of the training set.

leaf_size is not referenced for brute force queries.

1.6.5. Nearest Centroid Classifier

The NearestCentroid classifier is a simple algorithm that representseach class by the centroid of its members. In effect, this makes itsimilar to the label updating phase of the sklearn.cluster.KMeans algorithm.It also has no parameters to choose, making it a good baseline classifier. Itdoes, however, suffer on non-convex classes, as well as when classes havedrastically different variances, as equal variance in all dimensions isassumed. See Linear Discriminant Analysis (sklearn.discriminant_analysis.LinearDiscriminantAnalysis)and Quadratic Discriminant Analysis (sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis)for more complex methods that do not make this assumption. Usage of the defaultNearestCentroid is simple:

>>>

  1. >>> from sklearn.neighbors import NearestCentroid
  2. >>> import numpy as np
  3. >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
  4. >>> y = np.array([1, 1, 1, 2, 2, 2])
  5. >>> clf = NearestCentroid()
  6. >>> clf.fit(X, y)
  7. NearestCentroid()
  8. >>> print(clf.predict([[-0.8, -1]]))
  9. [1]

1.6.5.1. Nearest Shrunken Centroid

The NearestCentroid classifier has a shrink_threshold parameter,which implements the nearest shrunken centroid classifier. In effect, the valueof each feature for each centroid is divided by the within-class variance ofthat feature. The feature values are then reduced by shrink_threshold. Mostnotably, if a particular feature value crosses zero, it is setto zero. In effect, this removes the feature from affecting the classification.This is useful, for example, for removing noisy features.

In the example below, using a small shrink threshold increases the accuracy ofthe model from 0.81 to 0.82.

nearest_centroid_1nearest_centroid_2

Examples:

1.6.6. Nearest Neighbors Transformer

Many scikit-learn estimators rely on nearest neighbors: Several classifiers andregressors such as KNeighborsClassifier andKNeighborsRegressor, but also some clustering methods such asDBSCAN andSpectralClustering, and some manifold embeddings suchas TSNE and Isomap.

All these estimators can compute internally the nearest neighbors, but most ofthem also accept precomputed nearest neighbors sparse graph,as given by kneighbors_graph andradius_neighbors_graph. With modemode='connectivity', these functions return a binary adjacency sparse graphas required, for instance, in SpectralClustering.Whereas with mode='distance', they return a distance sparse graph as required,for instance, in DBSCAN. To include these functions ina scikit-learn pipeline, one can also use the corresponding classesKNeighborsTransformer and RadiusNeighborsTransformer.The benefits of this sparse graph API are multiple.

First, the precomputed graph can be re-used multiple times, for instance whilevarying a parameter of the estimator. This can be done manually by the user, orusing the caching properties of the scikit-learn pipeline:

>>>

  1. >>> from sklearn.manifold import Isomap
  2. >>> from sklearn.neighbors import KNeighborsTransformer
  3. >>> from sklearn.pipeline import make_pipeline
  4. >>> estimator = make_pipeline(
  5. ... KNeighborsTransformer(n_neighbors=5, mode='distance'),
  6. ... Isomap(neighbors_algorithm='precomputed'),
  7. ... memory='/path/to/cache')

Second, precomputing the graph can give finer control on the nearest neighborsestimation, for instance enabling multiprocessing though the parametern_jobs, which might not be available in all estimators.

Finally, the precomputation can be performed by custom estimators to usedifferent implementations, such as approximate nearest neighbors methods, orimplementation with special data types. The precomputed neighborssparse graph needs to be formatted as inradius_neighbors_graph output:

  • a CSR matrix (although COO, CSC or LIL will be accepted).

  • only explicitly store nearest neighborhoods of each sample with respect to thetraining data. This should include those at 0 distance from a query point,including the matrix diagonal when computing the nearest neighborhoodsbetween the training data and itself.

  • each row’s data should store the distance in increasing order (optional.Unsorted data will be stable-sorted, adding a computational overhead).

  • all values in data should be non-negative.

  • there should be no duplicate indices in any row(see https://github.com/scipy/scipy/issues/5807).

  • if the algorithm being passed the precomputed matrix uses k nearest neighbors(as opposed to radius neighborhood), at least k neighbors must be stored ineach row (or k+1, as explained in the following note).

Note

When a specific number of neighbors is queried (usingKNeighborsTransformer), the definition of n_neighbors is ambiguoussince it can either include each training point as its own neighbor, orexclude them. Neither choice is perfect, since including them leads to adifferent number of non-self neighbors during training and testing, whileexcluding them leads to a difference between fit(X).transform(X) andfit_transform(X), which is against scikit-learn API.In KNeighborsTransformer we use the definition which includes eachtraining point as its own neighbor in the count of n_neighbors. However,for compatibility reasons with other estimators which use the otherdefinition, one extra neighbor will be computed when mode == 'distance'.To maximise compatibility with all estimators, a safe choice is to alwaysinclude one extra neighbor in a custom nearest neighbors estimator, sinceunnecessary neighbors will be filtered by following estimators.

Examples:

1.6.7. Neighborhood Components Analysis

Neighborhood Components Analysis (NCA, NeighborhoodComponentsAnalysis)is a distance metric learning algorithm which aims to improve the accuracy ofnearest neighbors classification compared to the standard Euclidean distance.The algorithm directly maximizes a stochastic variant of the leave-one-outk-nearest neighbors (KNN) score on the training set. It can also learn alow-dimensional linear projection of data that can be used for datavisualization and fast classification.

nca_illustration_1nca_illustration_2

In the above illustrating figure, we consider some points from a randomlygenerated dataset. We focus on the stochastic KNN classification of point no.3. The thickness of a link between sample 3 and another point is proportionalto their distance, and can be seen as the relative weight (or probability) thata stochastic nearest neighbor prediction rule would assign to this point. Inthe original space, sample 3 has many stochastic neighbors from variousclasses, so the right class is not very likely. However, in the projected spacelearned by NCA, the only stochastic neighbors with non-negligible weight arefrom the same class as sample 3, guaranteeing that the latter will be wellclassified. See the mathematical formulationfor more details.

1.6.7.1. Classification

Combined with a nearest neighbors classifier (KNeighborsClassifier),NCA is attractive for classification because it can naturally handlemulti-class problems without any increase in the model size, and does notintroduce additional parameters that require fine-tuning by the user.

NCA classification has been shown to work well in practice for data sets ofvarying size and difficulty. In contrast to related methods such as LinearDiscriminant Analysis, NCA does not make any assumptions about the classdistributions. The nearest neighbor classification can naturally produce highlyirregular decision boundaries.

To use this model for classification, one needs to combine aNeighborhoodComponentsAnalysis instance that learns the optimaltransformation with a KNeighborsClassifier instance that performs theclassification in the projected space. Here is an example using the twoclasses:

>>>

  1. >>> from sklearn.neighbors import (NeighborhoodComponentsAnalysis,
  2. ... KNeighborsClassifier)
  3. >>> from sklearn.datasets import load_iris
  4. >>> from sklearn.model_selection import train_test_split
  5. >>> from sklearn.pipeline import Pipeline
  6. >>> X, y = load_iris(return_X_y=True)
  7. >>> X_train, X_test, y_train, y_test = train_test_split(X, y,
  8. ... stratify=y, test_size=0.7, random_state=42)
  9. >>> nca = NeighborhoodComponentsAnalysis(random_state=42)
  10. >>> knn = KNeighborsClassifier(n_neighbors=3)
  11. >>> nca_pipe = Pipeline([('nca', nca), ('knn', knn)])
  12. >>> nca_pipe.fit(X_train, y_train)
  13. Pipeline(...)
  14. >>> print(nca_pipe.score(X_test, y_test))
  15. 0.96190476...

nca_classification_1nca_classification_2

The plot shows decision boundaries for Nearest Neighbor Classification andNeighborhood Components Analysis classification on the iris dataset, whentraining and scoring on only two features, for visualisation purposes.

1.6.7.2. Dimensionality reduction

NCA can be used to perform supervised dimensionality reduction. The input dataare projected onto a linear subspace consisting of the directions whichminimize the NCA objective. The desired dimensionality can be set using theparameter n_components. For instance, the following figure shows acomparison of dimensionality reduction with Principal Component Analysis(sklearn.decomposition.PCA), Linear Discriminant Analysis(sklearn.discriminant_analysis.LinearDiscriminantAnalysis) andNeighborhood Component Analysis (NeighborhoodComponentsAnalysis) onthe Digits dataset, a dataset with size

1.6. Nearest Neighbors - 图69 and1.6. Nearest Neighbors - 图70. The data set is split into a training and a test setof equal size, then standardized. For evaluation the 3-nearest neighborclassification accuracy is computed on the 2-dimensional projected points foundby each method. Each data sample belongs to one of 10 classes.

nca_dim_reduction_1nca_dim_reduction_2nca_dim_reduction_3

Examples:

1.6.7.3. Mathematical formulation

The goal of NCA is to learn an optimal linear transformation matrix of size(n_components, n_features), which maximises the sum over all samples

1.6. Nearest Neighbors - 图74 of the probability1.6. Nearest Neighbors - 图75 that1.6. Nearest Neighbors - 图76 is correctlyclassified, i.e.:

1.6. Nearest Neighbors - 图77

with

1.6. Nearest Neighbors - 图78 = n_samples and1.6. Nearest Neighbors - 图79 the probability of sample1.6. Nearest Neighbors - 图80 being correctly classified according to a stochastic nearestneighbors rule in the learned embedded space:

1.6. Nearest Neighbors - 图81

where

1.6. Nearest Neighbors - 图82 is the set of points in the same class as sample1.6. Nearest Neighbors - 图83,and1.6. Nearest Neighbors - 图84 is the softmax over Euclidean distances in the embeddedspace:

1.6. Nearest Neighbors - 图85

1.6.7.3.1. Mahalanobis distance

NCA can be seen as learning a (squared) Mahalanobis distance metric:

1.6. Nearest Neighbors - 图86

where

1.6. Nearest Neighbors - 图87 is a symmetric positive semi-definite matrix of size(n_features, n_features).

1.6.7.4. Implementation

This implementation follows what is explained in the original paper 1. Forthe optimisation method, it currently uses scipy’s L-BFGS-B with a fullgradient computation at each iteration, to avoid to tune the learning rate andprovide stable learning.

See the examples below and the docstring ofNeighborhoodComponentsAnalysis.fit for further information.

1.6.7.5. Complexity

1.6.7.5.1. Training

NCA stores a matrix of pairwise distances, taking n_samples ** 2 memory.Time complexity depends on the number of iterations done by the optimisationalgorithm. However, one can set the maximum number of iterations with theargument max_iter. For each iteration, time complexity isO(n_components x n_samples x min(n_samples, n_features)).

1.6.7.5.2. Transform

Here the transform operation returns

1.6. Nearest Neighbors - 图88, therefore its timecomplexity equals n_components n_features n_samples_test. There is noadded space complexity in the operation.

References:

Wikipedia entry on Neighborhood Components Analysis