Extract distribution module and add support for dask.distributed

MaxBenChrist · jneuff · commit ba8e3db38625 · 2017-10-13T12:49:09.000+02:00
* Extract the multiprocessing and distributed computing into its own module to make it better extenable

* Added more comments

* Use an instance instead of the class itself

* Add the decode function to the partial evaluation

* First (unfinished) local testing version of big_fresh using lambdas

* big_fresh is not its own project

* Temporarily removed the decode

* Fixed bugs (e.g. after merging)

* draft for is_valid_ip_and_port method

* ipaddress package to check if ip

* unit testing is_valid_ip_and_port

* add ipaddress to requirements

* credit where it belongs!

* added dask-requirements.txt

* added cluster dask distributor

* address is sufficient to connect to dask cluster

* we support cluster calculations now

* need custom chunksize for dask on cluster

* close cluster dask client

* added unit test for partition

* add unit tests for calculate best chunksize

* added string manipulation and distribution to docs

* refactored distribution module

* added warning if no Distributor object is given

* py3 does not have .next method

* corrected extract features docstring

* removed automatic construction of ClusterDaskDistributor

* check ip address before passing it to ClusterDaskDistributor

* made the Distributor class abstract

* add _init_ for MapDistributor

* Mapdistributor does not need n_workers argument

* only use tqdm if distributor supports progressbar

* it is chunk_size not chunksize

* refactoring of distribution module

* cleaned imports in extraction

* not test the abstract Distributor baseclass

* still working on docstrings for distribution

* test MultiprocessingDistributor

* MapDistributor should use chunk_size 1

* set defaults for progress bar

* fix MultiprocessingDistributorTestCase

* new page about distributed freshness

* removed empty reurnts for close methods

* correct references in docs

* renamed Distributor to DistributorBaseClass

* add minimal example for distributor

* now the example is working

* refactoring

* polish doc page

* pass over cluster page

* no need to make calculate_best_chunks_size private

* another pass over cluster doc page

* correct warning

* add DistributorUsageTestCase

* most todos done

* move dask-requirements into requirements.txt

* save the number of workers

* dask distributors need to set partial as well

* earlier close distributors

* need to flatten dask results

* add unit test for lokal dask distributor

* use threads for local dask cluster, not processes

* removed is_valid_ip_and_port

* remove uncessary imports
diff --git a/README.md b/README.md
@@ -61,6 +61,7 @@ The algorithm is described in the following paper
 4. it has a comprehensive documentation
 5. it is compatible with sklearn, pandas and numpy
 6. it allows anyone to easily add their favorite features
+7. it both runs on your local machine or on even on a cluster
 
 ## Next steps
 
diff --git a/docs/api/tsfresh.utilities.rst b/docs/api/tsfresh.utilities.rst
@@ -22,3 +22,18 @@ profiling
     :undoc-members:
     :show-inheritance:
 
+string_manipulation
+-------------------
+
+.. automodule:: tsfresh.utilities.string_manipulation
+    :members:
+    :undoc-members:
+    :show-inheritance:
+
+distribution
+------------
+
+.. automodule:: tsfresh.utilities.distribution
+    :members:
+    :undoc-members:
+    :show-inheritance:
diff --git a/docs/index.rst b/docs/index.rst
@@ -30,6 +30,7 @@ The following chapters will explain the tsfresh package in detail:
    Feature Filtering <text/feature_filtering>
    How to write custom Feature Calculators <text/how_to_add_custom_feature>
    Parallelization <text/parallelization>
+   tsfresh on a cluster <text/tsfresh_on_a_cluster>
    Time Series Forecasting <text/forecasting>
    FAQ <text/faq>
    Authors <authors>
diff --git a/docs/text/tsfresh_on_a_cluster.rst b/docs/text/tsfresh_on_a_cluster.rst
@@ -0,0 +1,156 @@
+.. role:: python(code)
+    :language: python
+
+How to deploy tsfresh at scale
+==============================
+
+The high volume of time series data can demand an analysis at scale.
+So, time series need to be processed on a group of computational units instead of a singular machine.
+
+Accordingly, it may be necessary to distribute the extraction of time series features to a cluster.
+Indeed, it is possible to extract features with *tsfresh* in a distributed fashion.
+This page will explain how to setup a distributed *tsfresh*.
+
+The distributor class
+'''''''''''''''''''''
+
+To distribute the calculation of features, we use a certain object, the Distributor class (contained in the
+:mod:`tsfresh.utilities.distribution` module).
+
+Essentially, a Distributor organizes the application of feature calculators to data chunks.
+It maps the feature calculators to the data chunks and then reduces them, meaning that it combines the results of the
+individual mapping into one object, the feature matrix.
+
+So, Distributor will, in the following order,
+
+    1. calculates an optimal :python:`chunk_size`, based on the characteristics of the time series data at hand
+       (by :func:`~tsfresh.utilities.distribution.DistributorBaseClass.calculate_best_chunk_size`)
+
+    2. split the time series data into chunks
+       (by :func:`~tsfresh.utilities.distribution.DistributorBaseClass.partition`)
+
+    3. distribute the applying of the feature calculators to the data chunks
+       (by :func:`~tsfresh.utilities.distribution.DistributorBaseClass.distribute`)
+
+    4. combine the results into the feature matrix
+       (by :func:`~tsfresh.utilities.distribution.DistributorBaseClass.map_reduce`)
+
+    5. close all connections, shutdown all resources and clean everything
+       (by :func:`~tsfresh.utilities.distribution.DistributorBaseClass.close`)
+
+So, how can you use such a Distributor to extract features with *tsfresh*?
+You will have to pass it into as the :python:`distributor` argument to the :func:`~tsfresh.feature_extraction.extract_features`
+method.
+
+
+The following example shows how to define the MultiprocessingDistributor, which will distribute the calculations to a
+local pool of threads:
+
+.. code:: python
+
+    from tsfresh.examples.robot_execution_failures import \
+        download_robot_execution_failures, \
+        load_robot_execution_failures
+    from tsfresh.feature_extraction import extract_features
+    from tsfresh.utilities.distribution import MultiprocessingDistributor
+
+    # download and load some time series data
+    download_robot_execution_failures()
+    df, y = load_robot_execution_failures()
+
+    # We construct a Distributor that will spawn the calculations
+    # over four threads on the local machine
+    Distributor = MultiprocessingDistributor(n_workers=4,
+                                             disable_progressbar=False,
+                                             progressbar_title="Feature Extraction")
+
+    # just to pass the Distributor object to
+    # the feature extraction, along the other parameters
+    X = extract_features(timeseries_container=df,
+                         column_id='id', column_sort='time',
+                         distributor=Distributor)
+
+This example actually corresponds to the existing multiprocessing *tsfresh* API, where you just specify the number of
+jobs, without the need to construct the Distributor:
+
+.. code:: python
+
+    from tsfresh.examples.robot_execution_failures import \
+        download_robot_execution_failures, \
+        load_robot_execution_failures
+    from tsfresh.feature_extraction import extract_features
+
+    download_robot_execution_failures()
+    df, y = load_robot_execution_failures()
+
+    X = extract_features(timeseries_container=df,
+                         column_id='id', column_sort='time',
+                         n_jobs=4)
+
+Using dask to distribute the calculations
+'''''''''''''''''''''''''''''''''''''''''
+
+We provide distributor for the `dask framework <https://dask.pydata.org/en/latest/>`_, where
+*"Dask is a flexible parallel computing library for analytic computing."*
+
+Dask is a great framework to distribute analytic calculations to a cluster.
+It scales up and down, meaning that you can even use it on a singular machine.
+The only thing that you will need to run *tsfresh* on a Dask cluster is the ip address and port number of the
+`dask-scheduler <http://distributed.readthedocs.io/en/latest/setup.html>`_.
+
+Lets say that your dask scheduler is running at ``192.168.0.1:8786``, then we can easily construct a
+:class:`~sfresh.utilities.distribution.ClusterDaskDistributor` that connects to the sceduler and distributes the
+time series data and the calculation to a cluster:
+
+.. code:: python
+
+    from tsfresh.examples.robot_execution_failures import \
+        download_robot_execution_failures, \
+        load_robot_execution_failures
+    from tsfresh.feature_extraction import extract_features
+    from tsfresh.utilities.distribution import ClusterDaskDistributor
+
+    download_robot_execution_failures()
+    df, y = load_robot_execution_failures()
+
+    Distributor = ClusterDaskDistributor(address="192.168.0.1:8786")
+
+    X = extract_features(timeseries_container=df,
+                         column_id='id', column_sort='time',
+                         distributor=Distributor)
+
+Compared to the :class:`~tsfresh.utilities.distribution.MultiprocessingDistributor` example from above, we only had to
+change one line to switch from one machine to a whole cluster.
+It is as easy as that.
+By changing the Distributor you can easily deploy your application to run to a cluster instead of your workstation.
+
+You can also use a local DaskCluster on your local machine to emulate a Dask network.
+The following example shows how to setup a :class:`~tsfresh.utilities.distribution.LocalDaskDistributor` on a local cluster
+of 3 workers:
+
+.. code:: python
+
+    from tsfresh.examples.robot_execution_failures import \
+        download_robot_execution_failures, \
+        load_robot_execution_failures
+    from tsfresh.feature_extraction import extract_features
+    from tsfresh.utilities.distribution import LocalDaskDistributor
+
+    download_robot_execution_failures()
+    df, y = load_robot_execution_failures()
+
+    Distributor = LocalDaskDistributor(n_workers=3)
+
+    X = extract_features(timeseries_container=df,
+                         column_id='id', column_sort='time',
+                         distributor=Distributor)
+
+Writing your own distributor
+''''''''''''''''''''''''''''
+
+If you want to user another framework than Dask, you will have to write your own Distributor.
+To construct your custom Distributor, you will have to define an object that inherits from the abstract base class
+:class:`tsfresh.utilities.distribution.DistributorBaseClass`.
+The :mod:`tsfresh.utilities.distribution` module contains more information about what you will need to implement.
+
+
diff --git a/requirements.txt b/requirements.txt
@@ -8,3 +8,6 @@ scikit-learn>=0.17.1
 future>=0.16.0
 six>=1.10.0
 tqdm>=4.10.0
+ipaddress
+dask==0.15.2
+distributed==1.18.3
diff --git a/tests/units/feature_extraction/test_extraction.py b/tests/units/feature_extraction/test_extraction.py
@@ -9,13 +9,17 @@
 import numpy as np
 import pandas as pd
 import six
+from mock import Mock
 
 from tests.fixtures import DataTestCase
 from tsfresh.feature_extraction.extraction import extract_features
 from tsfresh.feature_extraction.settings import ComprehensiveFCParameters
 
 import tempfile
 
+from tsfresh.utilities.distribution import DistributorBaseClass
+
+
 class ExtractionTestCase(DataTestCase):
     """The unit tests in this module make sure if the time series features are created properly"""
 
@@ -170,3 +174,36 @@ def test_extract_features(self):
         self.assertTrue(np.all(extracted_features.b__abs_energy == np.array([36619, 35483])))
         self.assertTrue(np.all(extracted_features.b__mean == np.array([37.85, 34.75])))
         self.assertTrue(np.all(extracted_features.b__median == np.array([39.5, 28.0])))
+
+
+class DistributorUsageTestCase(DataTestCase):
+    def setUp(self):
+        # only calculate some features to reduce load on travis ci
+        self.name_to_param = {"maximum": None}
+
+    def test_assert_is_distributor(self):
+        df = self.create_test_data_sample()
+
+        self.assertRaises(ValueError, extract_features,
+                          timeseries_container=df, column_id="id", column_sort="sort", column_kind="kind",
+                          column_value="val", default_fc_parameters=self.name_to_param,
+                          distributor=object())
+
+        self.assertRaises(ValueError, extract_features,
+                          timeseries_container=df, column_id="id", column_sort="sort", column_kind="kind",
+                          column_value="val", default_fc_parameters=self.name_to_param,
+                          distributor=13)
+
+    def test_distributor_map_reduce_and_close_are_called(self):
+        df = self.create_test_data_sample()
+
+        mock = Mock(spec=DistributorBaseClass)
+        mock.close.return_value = None
+        mock.map_reduce.return_value = []
+
+        X = extract_features(timeseries_container=df, column_id="id", column_sort="sort", column_kind="kind",
+                             column_value="val", default_fc_parameters=self.name_to_param,
+                             distributor=mock)
+
+        self.assertTrue(mock.close.called)
+        self.assertTrue(mock.map_reduce.called)
diff --git a/tests/units/utilities/test_distribution.py b/tests/units/utilities/test_distribution.py
@@ -0,0 +1,64 @@
+# -*- coding: utf-8 -*-
+# This file as well as the whole tsfresh package are licenced under the MIT licence (see the LICENCE.txt)
+# Maximilian Christ (maximilianchrist.com), Blue Yonder Gmbh, 2016
+
+from unittest import TestCase
+import numpy as np
+import pandas as pd
+
+from tsfresh import extract_features
+from tsfresh.utilities.distribution import MultiprocessingDistributor, LocalDaskDistributor
+from tests.fixtures import DataTestCase
+
+
+class MultiprocessingDistributorTestCase(TestCase):
+
+    def test_partion(self):
+
+        distributor = MultiprocessingDistributor(n_workers=1)
+
+        data = [1, 3, 10, -10, 343.0]
+        distro = distributor.partition(data, 3)
+        self.assertEqual(next(distro), [1, 3, 10])
+        self.assertEqual(next(distro), [-10, 343.0])
+
+        data = np.arange(10)
+        distro = distributor.partition(data, 2)
+        self.assertEqual(next(distro), [0, 1])
+        self.assertEqual(next(distro), [2, 3])
+
+    def test__calculate_best_chunk_size(self):
+
+        distributor = MultiprocessingDistributor(n_workers=2)
+        self.assertEqual(distributor.calculate_best_chunk_size(10), 1)
+        self.assertEqual(distributor.calculate_best_chunk_size(11), 2)
+        self.assertEqual(distributor.calculate_best_chunk_size(100), 10)
+        self.assertEqual(distributor.calculate_best_chunk_size(101), 11)
+
+        distributor = MultiprocessingDistributor(n_workers=3)
+        self.assertEqual(distributor.calculate_best_chunk_size(10), 1)
+        self.assertEqual(distributor.calculate_best_chunk_size(30), 2)
+        self.assertEqual(distributor.calculate_best_chunk_size(31), 3)
+
+
+class LocalDaskDistributorTestCase(DataTestCase):
+
+    def test_local_dask_cluster_extraction(self):
+
+        Distributor = LocalDaskDistributor(n_workers=1)
+
+        df = self.create_test_data_sample()
+        extracted_features = extract_features(df, column_id="id", column_sort="sort", column_kind="kind",
+                                              column_value="val",
+                                              distributor=Distributor)
+
+        self.assertIsInstance(extracted_features, pd.DataFrame)
+        self.assertTrue(np.all(extracted_features.a__maximum == np.array([71, 77])))
+        self.assertTrue(np.all(extracted_features.a__sum_values == np.array([691, 1017])))
+        self.assertTrue(np.all(extracted_features.a__abs_energy == np.array([32211, 63167])))
+        self.assertTrue(np.all(extracted_features.b__sum_values == np.array([757, 695])))
+        self.assertTrue(np.all(extracted_features.b__minimum == np.array([3, 1])))
+        self.assertTrue(np.all(extracted_features.b__abs_energy == np.array([36619, 35483])))
+        self.assertTrue(np.all(extracted_features.b__mean == np.array([37.85, 34.75])))
+        self.assertTrue(np.all(extracted_features.b__median == np.array([39.5, 28.0])))
+
diff --git a/tests/units/utilities/test_string_manipilations.py b/tests/units/utilities/test_string_manipilations.py
@@ -22,4 +22,4 @@ def test_convert_to_output_format(self):
 
         out = convert_to_output_format({"list": ["a", "b", "c"]})
         expected_out = "list_['a', 'b', 'c']"
-        self.assertEqual(out, expected_out)
+        self.assertEqual(out, expected_out)
diff --git a/tsfresh/feature_extraction/extraction.py b/tsfresh/feature_extraction/extraction.py
diff --git a/tsfresh/utilities/distribution.py b/tsfresh/utilities/distribution.py
diff --git a/tsfresh/utilities/string_manipulation.py b/tsfresh/utilities/string_manipulation.py