Skip to content

Conversation

@facaiy
Copy link
Contributor

@facaiy facaiy commented Apr 1, 2017

add canMergeChildren param: find the pairs of leave of the same parent which output the same prediction, and merge them.

How was this patch tested?

  1. add unit test: verify whether implementation is correct.
  2. add unit test: verity whether setCanMergeChildren works.
  3. perhaps we need create a sample which can produce a reducible tree, and test it.

@facaiy
Copy link
Contributor Author

facaiy commented Apr 5, 2017

@jkbradley @hhbyyh Could you review the PR? thanks.

@facaiy
Copy link
Contributor Author

facaiy commented Apr 24, 2017

@srowen Hi, could you review the PR? The PR is simple, though many code for unit test are added. Thanks.

@srowen
Copy link
Member

srowen commented Apr 24, 2017

It looks reasonable though I don't feel qualified to review it. I thought the nodes had more than just the majority class - like the empirical distribution at the node? That would make them not possible to combine in general, but, I don't see that. However they do carry impurity info. Is that going to be equal in enough cases to make the merge effective?

@SparkQA
Copy link

SparkQA commented Apr 24, 2017

Test build #3675 has finished for PR 17503 at commit a8351f8.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* @param subsamplingRate Fraction of the training data used for learning decision tree.
* @param useNodeIdCache If this is true, instead of passing trees to executors, the algorithm will
* maintain a separate RDD of node Id cache for each row.
* @param canMergeChildren Merge pairs of leaf nodes of the same parent which
Copy link
Contributor Author

@facaiy facaiy Apr 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new parameter is added in Strategy class, which fails Mima tests. How to deal with it?

java.lang.RuntimeException: spark-mllib: Binary compatibility check failed!

[error]  * synthetic method <init>$default$13()Int in object org.apache.spark.mllib.tree.configuration.Strategy has a different result type in current version, where it is Boolean rather than Int

see failed logs

@facaiy
Copy link
Contributor Author

facaiy commented Apr 26, 2017

@srowen I am not sure whether I understand your question clearly. RandomForest uses LearningNode to construct tree model when training, and convert them to Leaf or InternalNode at last. Hence, all nodes are same type and can be merged when training.

However, if two children of a node output same prediction, does the node keep step with its children? I don't know.

@srowen
Copy link
Member

srowen commented Apr 26, 2017

I was saying that I thought the nodes had more info than just the majority-class prediction. If they did, then they're much more rarely combinable, because they vary in more than just their prediction. They don't have a distribution over class labels or something like that, but they do carry impurity info. Can you merge two nodes with different impurity but the same prediction? this could be a dumb question, I actually am not sure if impurity info is even used after training.

@facaiy
Copy link
Contributor Author

facaiy commented Apr 27, 2017

I have the same question with you. I guess that Impurity info is useful to debug and analysis tree model. However, as tree is grown from root to leaf when training, hence it seems needless to merge its sons.

@sethah
Copy link
Contributor

sethah commented Apr 28, 2017

I think the benefit of this would be for speed at predict time or for model storage. @srowen the nodes don't have to be equal to be merged, they just have to output the same prediction. Since this a param that can be turned on or off, I don't see a problem.

That said, I'd be interested to know how much of an impact this makes. This is a semi-large change and probably not at the top of the list right now. Maybe @jkbradley can comment.

@HyukjinKwon
Copy link
Member

(gentle ping @jkbradley)

@facaiy
Copy link
Contributor Author

facaiy commented Jul 4, 2017

@jkbradley May you have time reviewing the pr? I believe that it will be a little improvement for predict. Thanks.

@facaiy
Copy link
Contributor Author

facaiy commented Aug 26, 2017

Hi, @yanboliang . Do you have time to take a look at first? Thanks very much.

@WeichenXu123
Copy link
Contributor

Can you do some benchmark to show how much improvement this change will bring ?

@facaiy
Copy link
Contributor Author

facaiy commented Sep 27, 2017

HI, @WeichenXu123.

As said by @srowen , the benefit of this would be for speed at predict time or for model storage. Hence I'm not sure whether benchmark is really need for the PR.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@facaiy
Copy link
Contributor Author

facaiy commented Mar 3, 2018

Closed since its duplicate PR #20632 has been merged.

@facaiy facaiy closed this Mar 3, 2018
@facaiy facaiy deleted the CLN/check_for_reducible_decision_tree branch March 3, 2018 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants