-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21972][ML] Add param handlePersistence #19186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #81623 has finished for PR 19186 at commit
|
|
Jenkins, retest this please |
|
Test build #81626 has finished for PR 19186 at commit
|
smurching
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
I mainly had one comment that appeared across multiple files (we should still check that input data is unpersisted before attempting to cache it even when handlePersistence is true).
| } | ||
|
|
||
| if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK) | ||
| if ($(handlePersistence)) instances.persist(StorageLevel.MEMORY_AND_DISK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If $(handlePersistence) is true, we should still check that dataset is uncached (i.e. check that dataset.storageLevel == StorageLevel.NONE) before caching instances, or else we'll run into the issues described in SPARK-21799
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I supposed that it's up to the users to check the storageLevel to avoid double caching. But I now approve to check in the algs, and it may be better to log a warning if the dataset is already cached and the handlePersistence is set true.
| // persist if underlying dataset is not persistent. | ||
| val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE | ||
| if (handlePersistence) { | ||
| if ($(handlePersistence)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above, we should also check that dataset.storageLevel == StorageLevel.NONE before caching newDataset
| ParamDesc[Int]("aggregationDepth", "suggested depth for treeAggregate (>= 2)", Some("2"), | ||
| isValid = "ParamValidators.gtEq(2)", isExpertParam = true)) | ||
| isValid = "ParamValidators.gtEq(2)", isExpertParam = true), | ||
| ParamDesc[Boolean]("handlePersistence", "whether to handle data persistence", Some("true"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description could be a bit clearer, how about "if true, will cache unpersisted input data before fitting estimator on it"?
| lr.setMaxIter(optimizer.getNumIterations()) | ||
| lr.setTol(optimizer.getConvergenceTol()) | ||
| // Determine if we should cache the DF | ||
| lr.setHandlePersistence(input.getStorageLevel == StorageLevel.NONE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handlePersistence should be specified by the user rather than inferred by the algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mlllib.LoR do not expose HandlePersistence to users now, and I think it maybe better to keep it.
| } | ||
|
|
||
| if (handlePersistence) { | ||
| if ($(handlePersistence)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above, we should also check that dataset.storageLevel == StorageLevel.NONE
| val instances = extractWeightedLabeledPoints(dataset) | ||
| val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE | ||
| if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK) | ||
| if ($(handlePersistence)) instances.persist(StorageLevel.MEMORY_AND_DISK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above, we should also check that dataset.storageLevel == StorageLevel.NONE
|
Note: This PR follows up on the work/discussions in #17014 |
|
Test build #81650 has finished for PR 19186 at commit
|
|
Test build #81651 has finished for PR 19186 at commit
|
|
Test build #81704 has finished for PR 19186 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smurching @zhengruifeng
For estimators inherit Predictor(such as LiR/LoR), the checking if (dataset.storageLevel == StorageLevel.NONE) help nothing.
Because in Predictor.fit it generate casted Dataset from input Dataset, this cause the original Dataset storageLevel to be cleared. (i.e, dataset.storageLevel will always be None, even when the input dataset is cached).
I remember this issue has been discussed in Jira & old PRs, but this PR do not resolve this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yeah I had forgotten about that (thanks for the catch).
One solution could be to extend HasHandlePersistence in Predictor and check handlePersistence / cache uncached data in Predictor.fit() instead of Predictor.train(). This has the drawback of limiting individual algorithms' (e.g. LiR/LoR) ability to customize their caching behavior, which might be a deal-breaker.
@jkbradley I'd be curious to hear your thoughts on this.
|
@WeichenXu123 Thanks a lot for pointing it out! I also forgot about this. I think the issue of handling persistence maybe somewhat complexed, and it is up to the author of algs to decide whether, when and where to persist. I prefer this solution: add methods |
e40d3a1 to
aa04d4b
Compare
|
Test build #81751 has finished for PR 19186 at commit
|
|
Test build #81752 has finished for PR 19186 at commit
|
|
Test build #81762 has finished for PR 19186 at commit
|
|
Test build #81763 has finished for PR 19186 at commit
|
|
retest this please |
|
Test build #81771 has finished for PR 19186 at commit
|
|
@zhengruifeng Can you please update the PR description so it describes the actual functionality being added? |
|
This has ended up being more complex than we envisioned. It would be valuable to describe the design succinctly so that people can debate it on JIRA. Could you please describe your solution on the JIRA and ping people who have been discussing this? (I also made a request on the JIRA which I'd like to see addressed.) Thanks! |
|
@zhengruifeng You should write a design doc succinctly and link to jira first. After we reach agreement with the design this PR can move forward. Thanks! |
|
Test build #82179 has finished for PR 19186 at commit
|
|
retest this please |
|
Test build #82186 has finished for PR 19186 at commit
|
|
Test build #82200 has finished for PR 19186 at commit
|
What changes were proposed in this pull request?
1, add param
handlePersistence2, add methods
preprocessandpostprocessinPredictorHow was this patch tested?
existing tests