8. Computing with scikit-learn

8.1. Strategies to scale computationally: bigger data

For some applications the amount of examples, features (or both) and/or thespeed at which they need to be processed are challenging for traditionalapproaches. In these cases scikit-learn has a number of options you canconsider to make your system scale.

8.1.1. Scaling with instances using out-of-core learning

Out-of-core (or “external memory”) learning is a technique used to learn fromdata that cannot fit in a computer’s main memory (RAM).

Here is a sketch of a system designed to achieve this goal:

  1. a way to stream instances

  2. a way to extract features from instances

  3. an incremental algorithm

8.1.1.1. Streaming instances

Basically, 1. may be a reader that yields instances from files on ahard drive, a database, from a network stream etc. However,details on how to achieve this are beyond the scope of this documentation.

8.1.1.2. Extracting features

  1. could be any relevant way to extract features among thedifferent feature extraction methods supported byscikit-learn. However, when working with data that needs vectorization andwhere the set of features or values is not known in advance one should takeexplicit care. A good example is text classification where unknown terms arelikely to be found during training. It is possible to use a statefulvectorizer if making multiple passes over the data is reasonable from anapplication point of view. Otherwise, one can turn up the difficulty by usinga stateless feature extractor. Currently the preferred way to do this is touse the so-called hashing trick as implemented bysklearn.feature_extraction.FeatureHasher for datasets with categoricalvariables represented as list of Python dicts orsklearn.feature_extraction.text.HashingVectorizer for text documents.

8.1.1.3. Incremental learning

Finally, for 3. we have a number of options inside scikit-learn. Although notall algorithms can learn incrementally (i.e. without seeing all the instancesat once), all estimators implementing the partial_fit API are candidates.Actually, the ability to learn incrementally from a mini-batch of instances(sometimes called “online learning”) is key to out-of-core learning as itguarantees that at any given time there will be only a small amount ofinstances in the main memory. Choosing a good size for the mini-batch thatbalances relevancy and memory footprint could involve some tuning 1.

Here is a list of incremental estimators for different tasks:

For classification, a somewhat important thing to note is that although astateless feature extraction routine may be able to cope with new/unseenattributes, the incremental learner itself may be unable to cope withnew/unseen targets classes. In this case you have to pass all the possibleclasses to the first partial_fit call using the classes= parameter.

Another aspect to consider when choosing a proper algorithm is that not all ofthem put the same importance on each example over time. Namely, thePerceptron is still sensitive to badly labeled examples even after manyexamples whereas the SGD and PassiveAggressive families are morerobust to this kind of artifacts. Conversely, the latter also tend to give lessimportance to remarkably different, yet properly labeled examples when theycome late in the stream as their learning rate decreases over time.

8.1.1.4. Examples

Finally, we have a full-fledged example ofOut-of-core classification of text documents. It is aimed atproviding a starting point for people wanting to build out-of-core learningsystems and demonstrates most of the notions discussed above.

Furthermore, it also shows the evolution of the performance of differentalgorithms with the number of processed examples.

accuracy_over_time

Now looking at the computation time of the different parts, we see that thevectorization is much more expensive than learning itself. From the differentalgorithms, MultinomialNB is the most expensive, but its overhead can bemitigated by increasing the size of the mini-batches (exercise: changeminibatch_size to 100 and 10000 in the program and compare).

computation_time

8.1.1.5. Notes

  • 1
  • Depending on the algorithm the mini-batch size can influence results ornot. SGD, PassiveAggressive, and discrete NaiveBayes are truly onlineand are not affected by batch size. Conversely, MiniBatchKMeansconvergence rate is affected by the batch size. Also, its memoryfootprint can vary dramatically with batch size.

8.2. Computational Performance

For some applications the performance (mainly latency and throughput atprediction time) of estimators is crucial. It may also be of interest toconsider the training throughput but this is often less important in aproduction setup (where it often takes place offline).

We will review here the orders of magnitude you can expect from a number ofscikit-learn estimators in different contexts and provide some tips andtricks for overcoming performance bottlenecks.

Prediction latency is measured as the elapsed time necessary to make aprediction (e.g. in micro-seconds). Latency is often viewed as a distributionand operations engineers often focus on the latency at a given percentile ofthis distribution (e.g. the 90 percentile).

Prediction throughput is defined as the number of predictions the software candeliver in a given amount of time (e.g. in predictions per second).

An important aspect of performance optimization is also that it can hurtprediction accuracy. Indeed, simpler models (e.g. linear instead ofnon-linear, or with fewer parameters) often run faster but are not always ableto take into account the same exact properties of the data as more complex ones.

8.2.1. Prediction Latency

One of the most straight-forward concerns one may have when using/choosing amachine learning toolkit is the latency at which predictions can be made in aproduction environment.

  • The main factors that influence the prediction latency are
    • Number of features

    • Input data representation and sparsity

    • Model complexity

    • Feature extraction

A last major parameter is also the possibility to do predictions in bulk orone-at-a-time mode.

8.2.1.1. Bulk versus Atomic mode

In general doing predictions in bulk (many instances at the same time) ismore efficient for a number of reasons (branching predictability, CPU cache,linear algebra libraries optimizations etc.). Here we see on a settingwith few features that independently of estimator choice the bulk mode isalways faster, and for some of them by 1 to 2 orders of magnitude:

atomic_prediction_latency

bulk_prediction_latency

To benchmark different estimators for your case you can simply change then_features parameter in this example:Prediction Latency. This should giveyou an estimate of the order of magnitude of the prediction latency.

8.2.1.2. Configuring Scikit-learn for reduced validation overhead

Scikit-learn does some validation on data that increases the overhead percall to predict and similar functions. In particular, checking thatfeatures are finite (not NaN or infinite) involves a full pass over thedata. If you ensure that your data is acceptable, you may suppresschecking for finiteness by setting the environment variableSKLEARN_ASSUME_FINITE to a non-empty string before importingscikit-learn, or configure it in Python with sklearn.set_config.For more control than these global settings, a config_contextallows you to set this configuration within a specified context:

>>>

  1. >>> import sklearn
  2. >>> with sklearn.config_context(assume_finite=True):
  3. ... pass # do learning/prediction here with reduced validation

Note that this will affect all uses ofsklearn.utils.assert_all_finite within the context.

8.2.1.3. Influence of the Number of Features

Obviously when the number of features increases so does the memoryconsumption of each example. Indeed, for a matrix of

8. Computing with scikit-learn - 图5 instanceswith8. Computing with scikit-learn - 图6 features, the space complexity is in8. Computing with scikit-learn - 图7.From a computing perspective it also means that the number of basic operations(e.g., multiplications for vector-matrix products in linear models) increasestoo. Here is a graph of the evolution of the prediction latency with thenumber of features:

influence_of_n_features_on_latency

Overall you can expect the prediction time to increase at least linearly withthe number of features (non-linear cases can happen depending on the globalmemory footprint and estimator).

8.2.1.4. Influence of the Input Data Representation

Scipy provides sparse matrix data structures which are optimized for storingsparse data. The main feature of sparse formats is that you don’t store zerosso if your data is sparse then you use much less memory. A non-zero value ina sparse (CSR or CSC)representation will only take on average one 32bit integer position + the 64bit floating point value + an additional 32bit per row or column in the matrix.Using sparse input on a dense (or sparse) linear model can speedup predictionby quite a bit as only the non zero valued features impact the dot productand thus the model predictions. Hence if you have 100 non zeros in 1e6dimensional space, you only need 100 multiply and add operation instead of 1e6.

Calculation over a dense representation, however, may leverage highly optimisedvector operations and multithreading in BLAS, and tends to result in fewer CPUcache misses. So the sparsity should typically be quite high (10% non-zerosmax, to be checked depending on the hardware) for the sparse inputrepresentation to be faster than the dense input representation on a machinewith many CPUs and an optimized BLAS implementation.

Here is sample code to test the sparsity of your input:

  1. def sparsity_ratio(X):
  2. return 1.0 - np.count_nonzero(X) / float(X.shape[0] * X.shape[1])
  3. print("input sparsity ratio:", sparsity_ratio(X))

As a rule of thumb you can consider that if the sparsity ratio is greaterthan 90% you can probably benefit from sparse formats. Check Scipy’s sparsematrix formats documentationfor more information on how to build (or convert your data to) sparse matrixformats. Most of the time the CSR and CSC formats work best.

8.2.1.5. Influence of the Model Complexity

Generally speaking, when model complexity increases, predictive power andlatency are supposed to increase. Increasing predictive power is usuallyinteresting, but for many applications we would better not increaseprediction latency too much. We will now review this idea for differentfamilies of supervised models.

For sklearn.linear_model (e.g. Lasso, ElasticNet,SGDClassifier/Regressor, Ridge & RidgeClassifier,PassiveAggressiveClassifier/Regressor, LinearSVC, LogisticRegression…) thedecision function that is applied at prediction time is the same (a dot product), so latency should be equivalent.

Here is an example usingsklearn.linear_model.SGDClassifier with theelasticnet penalty. The regularization strength is globally controlled bythe alpha parameter. With a sufficiently high alpha,one can then increase the l1_ratio parameter of elasticnet toenforce various levels of sparsity in the model coefficients. Higher sparsityhere is interpreted as less model complexity as we need fewer coefficients todescribe it fully. Of course sparsity influences in turn the prediction timeas the sparse dot-product takes time roughly proportional to the number ofnon-zero coefficients.

en_model_complexity

For the sklearn.svm family of algorithms with a non-linear kernel,the latency is tied to the number of support vectors (the fewer the faster).Latency and throughput should (asymptotically) grow linearly with the numberof support vectors in a SVC or SVR model. The kernel will also influence thelatency as it is used to compute the projection of the input vector once persupport vector. In the following graph the nu parameter ofsklearn.svm.NuSVR was used to influence the number ofsupport vectors.

nusvr_model_complexity

For sklearn.ensemble of trees (e.g. RandomForest, GBT,ExtraTrees etc) the number of trees and their depth play the mostimportant role. Latency and throughput should scale linearly with the numberof trees. In this case we used directly the n_estimators parameter ofsklearn.ensemble.gradient_boosting.GradientBoostingRegressor.

gbt_model_complexity

In any case be warned that decreasing model complexity can hurt accuracy asmentioned above. For instance a non-linearly separable problem can be handledwith a speedy linear model but prediction power will very likely suffer inthe process.

8.2.1.6. Feature Extraction Latency

Most scikit-learn models are usually pretty fast as they are implementedeither with compiled Cython extensions or optimized computing libraries.On the other hand, in many real world applications the feature extractionprocess (i.e. turning raw data like database rows or network packets intonumpy arrays) governs the overall prediction time. For example on the Reuterstext classification task the whole preparation (reading and parsing SGMLfiles, tokenizing the text and hashing it into a common vector space) istaking 100 to 500 times more time than the actual prediction code, depending onthe chosen model.

prediction_time

In many cases it is thus recommended to carefully time and profile yourfeature extraction code as it may be a good place to start optimizing whenyour overall latency is too slow for your application.

8.2.2. Prediction Throughput

Another important metric to care about when sizing production systems is thethroughput i.e. the number of predictions you can make in a given amount oftime. Here is a benchmark from thePrediction Latency example that measuresthis quantity for a number of estimators on synthetic data:

throughput_benchmark

These throughputs are achieved on a single process. An obvious way toincrease the throughput of your application is to spawn additional instances(usually processes in Python because of theGIL) that share thesame model. One might also add machines to spread the load. A detailedexplanation on how to achieve this is beyond the scope of this documentationthough.

8.2.3. Tips and Tricks

8.2.3.1. Linear algebra libraries

As scikit-learn relies heavily on Numpy/Scipy and linear algebra in general itmakes sense to take explicit care of the versions of these libraries.Basically, you ought to make sure that Numpy is built using an optimized BLAS /LAPACK library.

Not all models benefit from optimized BLAS and Lapack implementations. Forinstance models based on (randomized) decision trees typically do not rely onBLAS calls in their inner loops, nor do kernel SVMs (SVC, SVR,NuSVC, NuSVR). On the other hand a linear model implemented with aBLAS DGEMM call (via numpy.dot) will typically benefit hugely from a tunedBLAS implementation and lead to orders of magnitude speedup over anon-optimized BLAS.

You can display the BLAS / LAPACK implementation used by your NumPy / SciPy /scikit-learn install with the following commands:

  1. from numpy.distutils.system_info import get_info
  2. print(get_info('blas_opt'))
  3. print(get_info('lapack_opt'))
  • Optimized BLAS / LAPACK implementations include:
    • Atlas (need hardware specific tuning by rebuilding on the target machine)

    • OpenBLAS

    • MKL

    • Apple Accelerate and vecLib frameworks (OSX only)

More information can be found on the Scipy install pageand in thisblog postfrom Daniel Nouri which has some nice step by step install instructions forDebian / Ubuntu.

8.2.3.2. Limiting Working Memory

Some calculations when implemented using standard numpy vectorized operationsinvolve using a large amount of temporary memory. This may potentially exhaustsystem memory. Where computations can be performed in fixed-memory chunks, weattempt to do so, and allow the user to hint at the maximum size of thisworking memory (defaulting to 1GB) using sklearn.set_config orconfig_context. The following suggests to limit temporary workingmemory to 128 MiB:

>>>

  1. >>> import sklearn
  2. >>> with sklearn.config_context(working_memory=128):
  3. ... pass # do chunked work here

An example of a chunked operation adhering to this setting ismetric.pairwise_distances_chunked, which facilitates computingrow-wise reductions of a pairwise distance matrix.

8.2.3.3. Model Compression

Model compression in scikit-learn only concerns linear models for the moment.In this context it means that we want to control the model sparsity (i.e. thenumber of non-zero coordinates in the model vectors). It is generally a goodidea to combine model sparsity with sparse input data representation.

Here is sample code that illustrates the use of the sparsify() method:

  1. clf = SGDRegressor(penalty='elasticnet', l1_ratio=0.25)
  2. clf.fit(X_train, y_train).sparsify()
  3. clf.predict(X_test)

In this example we prefer the elasticnet penalty as it is often a goodcompromise between model compactness and prediction power. One can alsofurther tune the l1_ratio parameter (in combination with theregularization strength alpha) to control this tradeoff.

A typical benchmarkon synthetic data yields a >30% decrease in latency when both the model andinput are sparse (with 0.000024 and 0.027400 non-zero coefficients ratiorespectively). Your mileage may vary depending on the sparsity and size ofyour data and model.Furthermore, sparsifying can be very useful to reduce the memory usage ofpredictive models deployed on production servers.

8.2.3.4. Model Reshaping

Model reshaping consists in selecting only a portion of the available featuresto fit a model. In other words, if a model discards features during thelearning phase we can then strip those from the input. This has severalbenefits. Firstly it reduces memory (and therefore time) overhead of themodel itself. It also allows to discard explicitfeature selection components in a pipeline once we know which features tokeep from a previous run. Finally, it can help reduce processing time and I/Ousage upstream in the data access and feature extraction layers by notcollecting and building features that are discarded by the model. For instanceif the raw data come from a database, it can make it possible to write simplerand faster queries or reduce I/O usage by making the queries return lighterrecords.At the moment, reshaping needs to be performed manually in scikit-learn.In the case of sparse input (particularly in CSR format), it is generallysufficient to not generate the relevant features, leaving their columns empty.

8.3. Parallelism, resource management, and configuration

8.3.1. Parallelism

Some scikit-learn estimators and utilities can parallelize costly operationsusing multiple CPU cores, thanks to the following components:

  • via the joblib library. Inthis case the number of threads or processes can be controlled with then_jobs parameter.

  • via OpenMP, used in C or Cython code.

In addition, some of the numpy routines that are used internally byscikit-learn may also be parallelized if numpy is installed with specificnumerical libraries such as MKL, OpenBLAS, or BLIS.

We describe these 3 scenarios in the following subsections.

8.3.1.1. Joblib-based parallelism

When the underlying implementation uses joblib, the number of workers(threads or processes) that are spawned in parallel can be controlled via then_jobs parameter.

Note

Where (and how) parallelization happens in the estimators is currentlypoorly documented. Please help us by improving our docs and tackle issue14228!

Joblib is able to support both multi-processing and multi-threading. Whetherjoblib chooses to spawn a thread or a process depends on the backendthat it’s using.

Scikit-learn generally relies on the loky backend, which is joblib’sdefault backend. Loky is a multi-processing backend. When doingmulti-processing, in order to avoid duplicating the memory in each process(which isn’t reasonable with big datasets), joblib will create a memmapthat all processes can share, when the data is bigger than 1MB.

In some specific cases (when the code that is run in parallel releases theGIL), scikit-learn will indicate to joblib that a multi-threadingbackend is preferable.

As a user, you may control the backend that joblib will use (regardless ofwhat scikit-learn recommends) by using a context manager:

  1. from joblib import parallel_backend
  2.  
  3. with parallel_backend('threading', n_jobs=2):
  4. # Your scikit-learn code here

Please refer to the joblib’s docsfor more details.

In practice, whether parallelism is helpful at improving runtime depends onmany factors. It is usually a good idea to experiment rather than assumingthat increasing the number of workers is always a good thing. In some casesit can be highly detrimental to performance to run multiple copies of someestimators or functions in parallel (see oversubscription below).

8.3.1.2. OpenMP-based parallelism

OpenMP is used to parallelize code written in Cython or C, relying onmulti-threading exclusively. By default (and unless joblib is trying toavoid oversubscription), the implementation will use as many threads aspossible.

You can control the exact number of threads that are used via theOMP_NUM_THREADS environment variable:

  1. OMP_NUM_THREADS=4 python my_script.py

8.3.1.3. Parallel Numpy routines from numerical libraries

Scikit-learn relies heavily on NumPy and SciPy, which internally callmulti-threaded linear algebra routines implemented in libraries such as MKL,OpenBLAS or BLIS.

The number of threads used by the OpenBLAS, MKL or BLIS libraries can be setvia the MKL_NUM_THREADS, OPENBLAS_NUM_THREADS, andBLIS_NUM_THREADS environment variables.

Please note that scikit-learn has no direct control over theseimplementations. Scikit-learn solely relies on Numpy and Scipy.

Note

At the time of writing (2019), NumPy and SciPy packages distributed onpypi.org (used by pip) and on the conda-forge channel are linkedwith OpenBLAS, while conda packages shipped on the “defaults” channelfrom anaconda.org are linked by default with MKL.

8.3.1.4. Oversubscription: spawning too many threads

It is generally recommended to avoid using significantly more processes orthreads than the number of CPUs on a machine. Over-subscription happens whena program is running too many threads at the same time.

Suppose you have a machine with 8 CPUs. Consider a case where you’re runninga GridSearchCV (parallelized with joblib) with n_jobs=8 overa HistGradientBoostingClassifier (parallelized with OpenMP). Eachinstance of HistGradientBoostingClassifier will spawn 8 threads(since you have 8 CPUs). That’s a total of 8 * 8 = 64 threads, whichleads to oversubscription of physical CPU resources and to schedulingoverhead.

Oversubscription can arise in the exact same fashion with parallelizedroutines from MKL, OpenBLAS or BLIS that are nested in joblib calls.

Starting from joblib >= 0.14, when the loky backend is used (whichis the default), joblib will tell its child processes to limit thenumber of threads they can use, so as to avoid oversubscription. In practicethe heuristic that joblib uses is to tell the processes to use max_threads= n_cpus // n_jobs, via their corresponding environment variable. Back toour example from above, since the joblib backend of GridSearchCVis loky, each process will only be able to use 1 thread instead of 8,thus mitigating the oversubscription issue.

Note that:

  • Manually setting one of the environment variables (OMP_NUM_THREADS,MKL_NUM_THREADS, OPENBLAS_NUM_THREADS, or BLIS_NUM_THREADS)will take precedence over what joblib tries to do. The total number ofthreads will be n_jobs * <LIB>_NUM_THREADS. Note that setting thislimit will also impact your computations in the main process, which willonly use <LIB>_NUM_THREADS. Joblib exposes a context manager forfiner control over the number of threads in its workers (see joblib docslinked below).

  • Joblib is currently unable to avoid oversubscription in amulti-threading context. It can only do so with the loky backend(which spawns processes).

You will find additional details about joblib mitigation of oversubscriptionin joblib documentation.

8.3.2. Configuration switches

8.3.2.1. Python runtime

sklearn.set_config controls the following behaviors:

  • assume_finite
  • used to skip validation, which enables faster computations but maylead to segmentation faults if the data contains NaNs.

  • working_memory

  • the optimal size of temporary arrays used by some algorithms.

8.3.2.2. Environment variables

These environment variables should be set before importing scikit-learn.

  • SKLEARN_SITE_JOBLIB
  • When this environment variable is set to a non zero value,scikit-learn uses the site joblib rather than its vendored version.Consequently, joblib must be installed for scikit-learn to run.Note that using the site joblib is at your own risks: the versions ofscikit-learn and joblib need to be compatible. Currently, joblib 0.11+is supported. In addition, dumps from joblib.Memory might be incompatible,and you might loose some caches and have to redownload some datasets.

Deprecated since version 0.21: As of version 0.21 this parameter has no effect, vendored joblib wasremoved and site joblib is always used.

  • SKLEARN_ASSUME_FINITE
  • Sets the default value for the assume_finite argument ofsklearn.set_config.

  • SKLEARN_WORKING_MEMORY

  • Sets the default value for the working_memory argument ofsklearn.set_config.

  • SKLEARN_SEED

  • Sets the seed of the global random generator when running the tests,for reproducibility.

  • SKLEARN_SKIP_NETWORK_TESTS

  • When this environment variable is set to a non zero value, the teststhat need network access are skipped.