Skip to content

Conversation

@weixuanfu
Copy link

@weixuanfu weixuanfu commented May 2, 2017

What does this PR do?

  1. Change the backend of the parameter max_eval_time_mins that controls how many minutes TPOT has to optimize a single pipeline. The backend is based on stopit module

  2. Change the backend of multiprocessing to dask instead of joblib

  3. Fix a old issue in Windows. Now TPOT allow Control+C during optimization process when n_job !=1 in Windows

  4. A new unit test for timeout function

  5. Add check_X_y for checking dataset format.

  6. Update documentation

Where should the reviewer start?

base.py

How should this PR be tested?

# coding: utf-8
from sklearn.datasets import make_classification
from tpot import TPOTClassifier
# make a huge dataset
X, y = make_classification(n_samples=50000, n_features=200,
                                    n_informative=20, n_redundant=20,
                                    n_classes=5, random_state=42)

# max_eval_time_mins=0.1 means 6 seconds limits for evaluating a single pipeline 
tpot = TPOTClassifier(generations=5, population_size=50, offspring_size=100, random_state=42, n_jobs=2, max_eval_time_mins=0.1, verbosity=3) 
tpot.fit(X, y)

Any background context you want to provide?

parallel processes freezing when matrices are too big
Joblib hangs without crashing

What are the relevant issues?

#436 #422

[you can link directly to issues by entering # then the number of the issue, for example, #3 links to issue 3]

Screenshots (if appropriate)

Questions:

  • Do the docs need to be updated? Yes, docs are already updated in the PR.
  • Does this PR add new (Python) dependencies? Yes, stopit and dask

@weixuanfu weixuanfu closed this May 2, 2017
@rhiever
Copy link

rhiever commented May 2, 2017

I think the best workaround (for now) is to make TPOT not use multiprocessing when n_jobs=1, and put warnings in the documentation that enabling multiprocessing (n_jobs!=1) may be slow and prone to freezing with very large datasets.

sklearn has this same problem, right? e.g. if you use cross_val_score with n_jobs!=1 for a very large dataset, it will also be slow and/or freeze.

@weixuanfu weixuanfu changed the title Disable memmaping of large arrays for freezing issue in multiprocessing Issue about large arrays for freezing in multiprocessing May 2, 2017
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.9%) to 85.983% when pulling a4956d4 on weixuanfu2016:joblib_timeout into 7bea1ee on rhiever:development.

@rhiever
Copy link

rhiever commented May 23, 2017

Looks like something broke with the rest of the merges. :-(

@weixuanfu
Copy link
Author

Conflicts fixed

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.9%) to 86.186% when pulling 211eed9 on weixuanfu2016:joblib_timeout into 305701c on rhiever:development.

@weixuanfu
Copy link
Author

weixuanfu commented May 23, 2017

I also added the patches in master branch (0.7.5) in this PR

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.03%) to 86.1% when pulling d8e1904 on weixuanfu2016:joblib_timeout into 305701c on rhiever:development.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 86.8% when pulling af01d55 on weixuanfu2016:joblib_timeout into 305701c on rhiever:development.

@rhiever rhiever closed this Jun 14, 2017
@weixuanfu weixuanfu deleted the joblib_timeout branch June 15, 2017 17:08
@weixuanfu weixuanfu restored the joblib_timeout branch June 26, 2017 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants