Skip to content

Conversation

@pronojitsaha
Copy link

Addresses #99

@pronojitsaha pronojitsaha changed the title Addresses #99 Change StratifiedShuffleSplit to ttrain_test_split Mar 12, 2016
@rhiever
Copy link

rhiever commented Mar 12, 2016

Hmmm... it's saying that no changes were made. Did you update from master and overwrite your changes?

@pronojitsaha
Copy link
Author

In hurry, forgot to commit! Should be good now.

@rhiever
Copy link

rhiever commented Mar 12, 2016

Thanks! Did you verify that it returns the same splits? If not, we'll have to verify that before merging.

@rhiever
Copy link

rhiever commented Mar 12, 2016

Just checked and, by default, train_test_split doesn't stratify the data by class. You have to pass the stratify option and a list of the class labels, e.g.,

X_train, X_test, y_train, y_test = train_test_split(input_data.drop('class', axis=1).values, 
                                                    input_data['class'].values,
                                                    train_size=0.75, test_size=0.25,
                                                    random_state=RANDOM_STATE,
                                                    stratify=input_data['class'].values)

Please make that change and let's see if that passes on Travis-CI. If it does, we'll merge away!

@pronojitsaha
Copy link
Author

Ok..sure.

@pronojitsaha
Copy link
Author

I have checked the splits, and the split ratio is same for both train_test_split and StratifiedShuffleSplit, but the split indices are somewhat different which is expected.

@rhiever
Copy link

rhiever commented Mar 13, 2016

If you set the random_state to the same thing in your tests, they should come out the same. That's what I verified on my end yesterday.


training_indices, testing_indices = next(iter(StratifiedShuffleSplit(training_testing_data['class'].values,
n_iter=1,
training_indices, testing_indices = train_test_split(training_testing_data.index,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right. The call needs to look something like:

(training_features, testing_features,
training_labels, testing_labels) = train_test_split(input_data.drop('class', axis=1).values, 
                                                    input_data['class'].values,
                                                    train_size=0.75, test_size=0.25,
                                                    random_state=RANDOM_STATE,
                                                    stratify=input_data['class'].values)

Have you tested this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have tested it on IRIS and MNIST dataset. Works the same. We can also do it the way you have pointed out, but using training_testing_data.index to get training_indices, testing_indices is in line with rest of our code format.

@pronojitsaha
Copy link
Author

Ok, got that. Thanks.

@rhiever
Copy link

rhiever commented Mar 13, 2016

Understood. Alright, looks good to merge. Thanks again! :-)

rhiever pushed a commit that referenced this pull request Mar 13, 2016
Change StratifiedShuffleSplit to ttrain_test_split
@rhiever rhiever merged commit 73f5b38 into EpistasisLab:master Mar 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants