@@ -10,38 +10,46 @@ Parallelism, resource management, and configuration
1010Parallelism
1111-----------
1212
13- Some scikit-learn estimators and utilities can parallelize costly operations
14- using multiple CPU cores, thanks to the following components:
13+ Some scikit-learn estimators and utilities parallelize costly operations
14+ using multiple CPU cores.
1515
16- - via the `joblib <https://joblib.readthedocs.io/en/latest/ >`_ library. In
17- this case the number of threads or processes can be controlled with the
18- ``n_jobs `` parameter.
19- - via OpenMP, used in C or Cython code.
16+ Depending on the type of estimator and sometimes the values of the
17+ constructor parameters, this is either done:
2018
21- In addition, some of the numpy routines that are used internally by
22- scikit-learn may also be parallelized if numpy is installed with specific
23- numerical libraries such as MKL, OpenBLAS, or BLIS.
19+ - with higher-level parallelism via `joblib <https://joblib.readthedocs.io/en/latest/ >`_.
20+ - with lower-level parallelism via OpenMP, used in C or Cython code.
21+ - with lower-level parallelism via BLAS, used by NumPy and SciPy for generic operations
22+ on arrays.
2423
25- We describe these 3 scenarios in the following subsections.
24+ The `n_jobs ` parameters of estimators always controls the amount of parallelism
25+ managed by joblib (processes or threads depending on the joblib backend).
26+ The thread-level parallelism managed by OpenMP in scikit-learn's own Cython code
27+ or by BLAS & LAPACK libraries used by NumPy and SciPy operations used in scikit-learn
28+ is always controlled by environment variables or `threadpoolctl ` as explained below.
29+ Note that some estimators can leverage all three kinds of parallelism at different
30+ points of their training and prediction methods.
2631
27- Joblib-based parallelism
28- ........................
32+ We describe these 3 types of parallelism in the following subsections in more details.
33+
34+ Higher-level parallelism with joblib
35+ ....................................
2936
3037When the underlying implementation uses joblib, the number of workers
3138(threads or processes) that are spawned in parallel can be controlled via the
3239``n_jobs `` parameter.
3340
3441.. note ::
3542
36- Where (and how) parallelization happens in the estimators is currently
37- poorly documented. Please help us by improving our docs and tackle `issue
38- 14228 <https://github.com/scikit-learn/scikit-learn/issues/14228> `_!
43+ Where (and how) parallelization happens in the estimators using joblib by
44+ specifying `n_jobs ` is currently poorly documented.
45+ Please help us by improving our docs and tackle `issue 14228
46+ <https://github.com/scikit-learn/scikit-learn/issues/14228> `_!
3947
4048Joblib is able to support both multi-processing and multi-threading. Whether
4149joblib chooses to spawn a thread or a process depends on the **backend **
4250that it's using.
4351
44- Scikit -learn generally relies on the ``loky `` backend, which is joblib's
52+ scikit -learn generally relies on the ``loky `` backend, which is joblib's
4553default backend. Loky is a multi-processing backend. When doing
4654multi-processing, in order to avoid duplicating the memory in each process
4755(which isn't reasonable with big datasets), joblib will create a `memmap
@@ -70,40 +78,57 @@ that increasing the number of workers is always a good thing. In some cases
7078it can be highly detrimental to performance to run multiple copies of some
7179estimators or functions in parallel (see oversubscription below).
7280
73- OpenMP-based parallelism
74- ........................
81+ Lower-level parallelism with OpenMP
82+ ...................................
7583
7684OpenMP is used to parallelize code written in Cython or C, relying on
77- multi-threading exclusively. By default (and unless joblib is trying to
78- avoid oversubscription), the implementation will use as many threads as
79- possible.
85+ multi-threading exclusively. By default, the implementations using OpenMP
86+ will use as many threads as possible, i.e. as many threads as logical cores.
8087
81- You can control the exact number of threads that are used via the
82- ``OMP_NUM_THREADS `` environment variable:
88+ You can control the exact number of threads that are used either:
8389
84- .. prompt :: bash $
90+ - via the ``OMP_NUM_THREADS `` environment variable, for instance when:
91+ running a python script:
92+
93+ .. prompt :: bash $
94+
95+ OMP_NUM_THREADS=4 python my_script.py
8596
86- OMP_NUM_THREADS=4 python my_script.py
97+ - or via `threadpoolctl ` as explained by `this piece of documentation
98+ <https://github.com/joblib/threadpoolctl/#setting-the-maximum-size-of-thread-pools> `_.
8799
88- Parallel Numpy routines from numerical libraries
89- ................................................
100+ Parallel NumPy and SciPy routines from numerical libraries
101+ ..........................................................
90102
91- Scikit -learn relies heavily on NumPy and SciPy, which internally call
92- multi-threaded linear algebra routines implemented in libraries such as MKL,
93- OpenBLAS or BLIS.
103+ scikit -learn relies heavily on NumPy and SciPy, which internally call
104+ multi-threaded linear algebra routines (BLAS & LAPACK) implemented in libraries
105+ such as MKL, OpenBLAS or BLIS.
94106
95- The number of threads used by the OpenBLAS, MKL or BLIS libraries can be set
96- via the ``MKL_NUM_THREADS ``, ``OPENBLAS_NUM_THREADS ``, and
97- ``BLIS_NUM_THREADS `` environment variables.
107+ You can control the exact number of threads used by BLAS for each library
108+ using environment variables, namely:
109+
110+ - ``MKL_NUM_THREADS `` sets the number of thread MKL uses,
111+ - ``OPENBLAS_NUM_THREADS `` sets the number of threads OpenBLAS uses
112+ - ``BLIS_NUM_THREADS `` sets the number of threads BLIS uses
113+
114+ Note that BLAS & LAPACK implementations can also be impacted by
115+ `OMP_NUM_THREADS `. To check whether this is the case in your environment,
116+ you can inspect how the number of threads effectively used by those libraries
117+ is affected when running the the following command in a bash or zsh terminal
118+ for different values of `OMP_NUM_THREADS `::
119+
120+ .. prompt :: bash $
98121
99- Please note that scikit-learn has no direct control over these
100- implementations. Scikit-learn solely relies on Numpy and Scipy.
122+ OMP_NUM_THREADS=2 python -m threadpoolctl -i numpy scipy
101123
102124.. note ::
103- At the time of writing (2019), NumPy and SciPy packages distributed on
104- pypi.org (used by ``pip ``) and on the conda-forge channel are linked
105- with OpenBLAS, while conda packages shipped on the "defaults" channel
106- from anaconda.org are linked by default with MKL.
125+ At the time of writing (2022), NumPy and SciPy packages which are
126+ distributed on pypi.org (i.e. the ones installed via ``pip install ``)
127+ and on the conda-forge channel (i.e. the ones installed via
128+ ``conda install --channel conda-forge ``) are linked with OpenBLAS, while
129+ NumPy and SciPy packages packages shipped on the ``defaults `` conda
130+ channel from Anaconda.org (i.e. the ones installed via ``conda install ``)
131+ are linked by default with MKL.
107132
108133
109134Oversubscription: spawning too many threads
@@ -120,8 +145,8 @@ with ``n_jobs=8`` over a
120145OpenMP). Each instance of
121146:class: `~sklearn.ensemble.HistGradientBoostingClassifier ` will spawn 8 threads
122147(since you have 8 CPUs). That's a total of ``8 * 8 = 64 `` threads, which
123- leads to oversubscription of physical CPU resources and to scheduling
124- overhead.
148+ leads to oversubscription of threads for physical CPU resources and thus
149+ to scheduling overhead.
125150
126151Oversubscription can arise in the exact same fashion with parallelized
127152routines from MKL, OpenBLAS or BLIS that are nested in joblib calls.
@@ -146,38 +171,34 @@ Note that:
146171 only use ``<LIB>_NUM_THREADS ``. Joblib exposes a context manager for
147172 finer control over the number of threads in its workers (see joblib docs
148173 linked below).
149- - Joblib is currently unable to avoid oversubscription in a
150- multi-threading context. It can only do so with the ``loky `` backend
151- (which spawns processes).
174+ - When joblib is configured to use the ``threading `` backend, there is no
175+ mechanism to avoid oversubscriptions when calling into parallel native
176+ libraries in the joblib-managed threads.
177+ - All scikit-learn estimators that explicitly rely on OpenMP in their Cython code
178+ always use `threadpoolctl ` internally to automatically adapt the numbers of
179+ threads used by OpenMP and potentially nested BLAS calls so as to avoid
180+ oversubscription.
152181
153182You will find additional details about joblib mitigation of oversubscription
154183in `joblib documentation
155184<https://joblib.readthedocs.io/en/latest/parallel.html#avoiding-over-subscription-of-cpu-resources> `_.
156185
186+ You will find additional details about parallelism in numerical python libraries
187+ in `this document from Thomas J. Fan <https://thomasjpfan.github.io/parallelism-python-libraries-design/ >`_.
157188
158189Configuration switches
159190-----------------------
160191
161- Python runtime
162- ..............
192+ Python API
193+ ..........
163194
164- :func: `sklearn.set_config ` controls the following behaviors:
165-
166- `assume_finite `
167- ~~~~~~~~~~~~~~~
168-
169- Used to skip validation, which enables faster computations but may lead to
170- segmentation faults if the data contains NaNs.
171-
172- `working_memory `
173- ~~~~~~~~~~~~~~~~
174-
175- The optimal size of temporary arrays used by some algorithms.
195+ :func: `sklearn.set_config ` and :func: `sklearn.config_context ` can be used to change
196+ parameters of the configuration which control aspect of parallelism.
176197
177198.. _environment_variable :
178199
179200Environment variables
180- ......................
201+ .....................
181202
182203These environment variables should be set before importing scikit-learn.
183204
@@ -277,3 +298,14 @@ float64 data.
277298When this environment variable is set to a non zero value, the `Cython `
278299derivative, `boundscheck ` is set to `True `. This is useful for finding
279300segfaults.
301+
302+ `SKLEARN_PAIRWISE_DIST_CHUNK_SIZE `
303+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
304+
305+ This sets the size of chunk to be used by the underlying `PairwiseDistancesReductions `
306+ implementations. The default value is `256 ` which has been showed to be adequate on
307+ most machines.
308+
309+ Users looking for the best performance might want to tune this variable using
310+ powers of 2 so as to get the best parallelism behavior for their hardware,
311+ especially with respect to their caches' sizes.
0 commit comments