-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-3159][MLlib] Check for reducible DecisionTree #17503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3159][MLlib] Check for reducible DecisionTree #17503
Conversation
|
@jkbradley @hhbyyh Could you review the PR? thanks. |
|
@srowen Hi, could you review the PR? The PR is simple, though many code for unit test are added. Thanks. |
|
It looks reasonable though I don't feel qualified to review it. I thought the nodes had more than just the majority class - like the empirical distribution at the node? That would make them not possible to combine in general, but, I don't see that. However they do carry impurity info. Is that going to be equal in enough cases to make the merge effective? |
|
Test build #3675 has finished for PR 17503 at commit
|
| * @param subsamplingRate Fraction of the training data used for learning decision tree. | ||
| * @param useNodeIdCache If this is true, instead of passing trees to executors, the algorithm will | ||
| * maintain a separate RDD of node Id cache for each row. | ||
| * @param canMergeChildren Merge pairs of leaf nodes of the same parent which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new parameter is added in Strategy class, which fails Mima tests. How to deal with it?
java.lang.RuntimeException: spark-mllib: Binary compatibility check failed!
[error] * synthetic method <init>$default$13()Int in object org.apache.spark.mllib.tree.configuration.Strategy has a different result type in current version, where it is Boolean rather than Int|
@srowen I am not sure whether I understand your question clearly. RandomForest uses LearningNode to construct tree model when training, and convert them to Leaf or InternalNode at last. Hence, all nodes are same type and can be merged when training. However, if two children of a node output same prediction, does the node keep step with its children? I don't know. |
|
I was saying that I thought the nodes had more info than just the majority-class prediction. If they did, then they're much more rarely combinable, because they vary in more than just their prediction. They don't have a distribution over class labels or something like that, but they do carry impurity info. Can you merge two nodes with different impurity but the same prediction? this could be a dumb question, I actually am not sure if impurity info is even used after training. |
|
I have the same question with you. I guess that Impurity info is useful to debug and analysis tree model. However, as tree is grown from root to leaf when training, hence it seems needless to merge its sons. |
|
I think the benefit of this would be for speed at predict time or for model storage. @srowen the nodes don't have to be equal to be merged, they just have to output the same prediction. Since this a param that can be turned on or off, I don't see a problem. That said, I'd be interested to know how much of an impact this makes. This is a semi-large change and probably not at the top of the list right now. Maybe @jkbradley can comment. |
|
(gentle ping @jkbradley) |
|
@jkbradley May you have time reviewing the pr? I believe that it will be a little improvement for predict. Thanks. |
|
Hi, @yanboliang . Do you have time to take a look at first? Thanks very much. |
|
Can you do some benchmark to show how much improvement this change will bring ? |
|
HI, @WeichenXu123. As said by @srowen , the benefit of this would be for speed at predict time or for model storage. Hence I'm not sure whether benchmark is really need for the PR. |
|
Can one of the admins verify this patch? |
|
Closed since its duplicate PR #20632 has been merged. |
add canMergeChildren param: find the pairs of leave of the same parent which output the same prediction, and merge them.
How was this patch tested?