|
| 1 | +.. role:: python(code) |
| 2 | + :language: python |
| 3 | + |
| 4 | +How to deploy tsfresh at scale |
| 5 | +============================== |
| 6 | + |
| 7 | +The high volume of time series data can demand an analysis at scale. |
| 8 | +So, time series need to be processed on a group of computational units instead of a singular machine. |
| 9 | + |
| 10 | +Accordingly, it may be necessary to distribute the extraction of time series features to a cluster. |
| 11 | +Indeed, it is possible to extract features with *tsfresh* in a distributed fashion. |
| 12 | +This page will explain how to setup a distributed *tsfresh*. |
| 13 | + |
| 14 | +The distributor class |
| 15 | +''''''''''''''''''''' |
| 16 | + |
| 17 | +To distribute the calculation of features, we use a certain object, the Distributor class (contained in the |
| 18 | +:mod:`tsfresh.utilities.distribution` module). |
| 19 | + |
| 20 | +Essentially, a Distributor organizes the application of feature calculators to data chunks. |
| 21 | +It maps the feature calculators to the data chunks and then reduces them, meaning that it combines the results of the |
| 22 | +individual mapping into one object, the feature matrix. |
| 23 | + |
| 24 | +So, Distributor will, in the following order, |
| 25 | + |
| 26 | + 1. calculates an optimal :python:`chunk_size`, based on the characteristics of the time series data at hand |
| 27 | + (by :func:`~tsfresh.utilities.distribution.DistributorBaseClass.calculate_best_chunk_size`) |
| 28 | + |
| 29 | + 2. split the time series data into chunks |
| 30 | + (by :func:`~tsfresh.utilities.distribution.DistributorBaseClass.partition`) |
| 31 | + |
| 32 | + 3. distribute the applying of the feature calculators to the data chunks |
| 33 | + (by :func:`~tsfresh.utilities.distribution.DistributorBaseClass.distribute`) |
| 34 | + |
| 35 | + 4. combine the results into the feature matrix |
| 36 | + (by :func:`~tsfresh.utilities.distribution.DistributorBaseClass.map_reduce`) |
| 37 | + |
| 38 | + 5. close all connections, shutdown all resources and clean everything |
| 39 | + (by :func:`~tsfresh.utilities.distribution.DistributorBaseClass.close`) |
| 40 | + |
| 41 | +So, how can you use such a Distributor to extract features with *tsfresh*? |
| 42 | +You will have to pass it into as the :python:`distributor` argument to the :func:`~tsfresh.feature_extraction.extract_features` |
| 43 | +method. |
| 44 | + |
| 45 | + |
| 46 | +The following example shows how to define the MultiprocessingDistributor, which will distribute the calculations to a |
| 47 | +local pool of threads: |
| 48 | + |
| 49 | +.. code:: python |
| 50 | +
|
| 51 | + from tsfresh.examples.robot_execution_failures import \ |
| 52 | + download_robot_execution_failures, \ |
| 53 | + load_robot_execution_failures |
| 54 | + from tsfresh.feature_extraction import extract_features |
| 55 | + from tsfresh.utilities.distribution import MultiprocessingDistributor |
| 56 | +
|
| 57 | + # download and load some time series data |
| 58 | + download_robot_execution_failures() |
| 59 | + df, y = load_robot_execution_failures() |
| 60 | +
|
| 61 | + # We construct a Distributor that will spawn the calculations |
| 62 | + # over four threads on the local machine |
| 63 | + Distributor = MultiprocessingDistributor(n_workers=4, |
| 64 | + disable_progressbar=False, |
| 65 | + progressbar_title="Feature Extraction") |
| 66 | +
|
| 67 | + # just to pass the Distributor object to |
| 68 | + # the feature extraction, along the other parameters |
| 69 | + X = extract_features(timeseries_container=df, |
| 70 | + column_id='id', column_sort='time', |
| 71 | + distributor=Distributor) |
| 72 | +
|
| 73 | +This example actually corresponds to the existing multiprocessing *tsfresh* API, where you just specify the number of |
| 74 | +jobs, without the need to construct the Distributor: |
| 75 | + |
| 76 | +.. code:: python |
| 77 | +
|
| 78 | + from tsfresh.examples.robot_execution_failures import \ |
| 79 | + download_robot_execution_failures, \ |
| 80 | + load_robot_execution_failures |
| 81 | + from tsfresh.feature_extraction import extract_features |
| 82 | +
|
| 83 | + download_robot_execution_failures() |
| 84 | + df, y = load_robot_execution_failures() |
| 85 | +
|
| 86 | + X = extract_features(timeseries_container=df, |
| 87 | + column_id='id', column_sort='time', |
| 88 | + n_jobs=4) |
| 89 | +
|
| 90 | +Using dask to distribute the calculations |
| 91 | +''''''''''''''''''''''''''''''''''''''''' |
| 92 | + |
| 93 | +We provide distributor for the `dask framework <https://dask.pydata.org/en/latest/>`_, where |
| 94 | +*"Dask is a flexible parallel computing library for analytic computing."* |
| 95 | + |
| 96 | +Dask is a great framework to distribute analytic calculations to a cluster. |
| 97 | +It scales up and down, meaning that you can even use it on a singular machine. |
| 98 | +The only thing that you will need to run *tsfresh* on a Dask cluster is the ip address and port number of the |
| 99 | +`dask-scheduler <http://distributed.readthedocs.io/en/latest/setup.html>`_. |
| 100 | + |
| 101 | +Lets say that your dask scheduler is running at ``192.168.0.1:8786``, then we can easily construct a |
| 102 | +:class:`~sfresh.utilities.distribution.ClusterDaskDistributor` that connects to the sceduler and distributes the |
| 103 | +time series data and the calculation to a cluster: |
| 104 | + |
| 105 | +.. code:: python |
| 106 | +
|
| 107 | + from tsfresh.examples.robot_execution_failures import \ |
| 108 | + download_robot_execution_failures, \ |
| 109 | + load_robot_execution_failures |
| 110 | + from tsfresh.feature_extraction import extract_features |
| 111 | + from tsfresh.utilities.distribution import ClusterDaskDistributor |
| 112 | +
|
| 113 | + download_robot_execution_failures() |
| 114 | + df, y = load_robot_execution_failures() |
| 115 | +
|
| 116 | + Distributor = ClusterDaskDistributor(address="192.168.0.1:8786") |
| 117 | +
|
| 118 | + X = extract_features(timeseries_container=df, |
| 119 | + column_id='id', column_sort='time', |
| 120 | + distributor=Distributor) |
| 121 | +
|
| 122 | +Compared to the :class:`~tsfresh.utilities.distribution.MultiprocessingDistributor` example from above, we only had to |
| 123 | +change one line to switch from one machine to a whole cluster. |
| 124 | +It is as easy as that. |
| 125 | +By changing the Distributor you can easily deploy your application to run to a cluster instead of your workstation. |
| 126 | + |
| 127 | +You can also use a local DaskCluster on your local machine to emulate a Dask network. |
| 128 | +The following example shows how to setup a :class:`~tsfresh.utilities.distribution.LocalDaskDistributor` on a local cluster |
| 129 | +of 3 workers: |
| 130 | + |
| 131 | +.. code:: python |
| 132 | +
|
| 133 | + from tsfresh.examples.robot_execution_failures import \ |
| 134 | + download_robot_execution_failures, \ |
| 135 | + load_robot_execution_failures |
| 136 | + from tsfresh.feature_extraction import extract_features |
| 137 | + from tsfresh.utilities.distribution import LocalDaskDistributor |
| 138 | +
|
| 139 | + download_robot_execution_failures() |
| 140 | + df, y = load_robot_execution_failures() |
| 141 | +
|
| 142 | + Distributor = LocalDaskDistributor(n_workers=3) |
| 143 | +
|
| 144 | + X = extract_features(timeseries_container=df, |
| 145 | + column_id='id', column_sort='time', |
| 146 | + distributor=Distributor) |
| 147 | +
|
| 148 | +Writing your own distributor |
| 149 | +'''''''''''''''''''''''''''' |
| 150 | + |
| 151 | +If you want to user another framework than Dask, you will have to write your own Distributor. |
| 152 | +To construct your custom Distributor, you will have to define an object that inherits from the abstract base class |
| 153 | +:class:`tsfresh.utilities.distribution.DistributorBaseClass`. |
| 154 | +The :mod:`tsfresh.utilities.distribution` module contains more information about what you will need to implement. |
| 155 | + |
| 156 | + |
0 commit comments