[SPARK-3159][MLlib] Check for reducible DecisionTree #17503

facaiy · 2017-04-01T06:29:46Z

add canMergeChildren param: find the pairs of leave of the same parent which output the same prediction, and merge them.

How was this patch tested?

add unit test: verify whether implementation is correct.
add unit test: verity whether setCanMergeChildren works.
perhaps we need create a sample which can produce a reducible tree, and test it.

facaiy · 2017-04-05T09:32:45Z

@jkbradley @hhbyyh Could you review the PR? thanks.

facaiy · 2017-04-24T07:11:30Z

@srowen Hi, could you review the PR? The PR is simple, though many code for unit test are added. Thanks.

srowen · 2017-04-24T08:09:26Z

It looks reasonable though I don't feel qualified to review it. I thought the nodes had more than just the majority class - like the empirical distribution at the node? That would make them not possible to combine in general, but, I don't see that. However they do carry impurity info. Is that going to be equal in enough cases to make the merge effective?

SparkQA · 2017-04-24T10:41:52Z

Test build #3675 has finished for PR 17503 at commit a8351f8.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

facaiy · 2017-04-26T03:52:56Z

mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala

 * @param subsamplingRate Fraction of the training data used for learning decision tree.
 * @param useNodeIdCache If this is true, instead of passing trees to executors, the algorithm will
 *                       maintain a separate RDD of node Id cache for each row.
+ * @param canMergeChildren Merge pairs of leaf nodes of the same parent which


A new parameter is added in Strategy class, which fails Mima tests. How to deal with it?

java.lang.RuntimeException: spark-mllib: Binary compatibility check failed! [error] * synthetic method <init>$default$13()Int in object org.apache.spark.mllib.tree.configuration.Strategy has a different result type in current version, where it is Boolean rather than Int

see failed logs

facaiy · 2017-04-26T04:10:46Z

@srowen I am not sure whether I understand your question clearly. RandomForest uses LearningNode to construct tree model when training, and convert them to Leaf or InternalNode at last. Hence, all nodes are same type and can be merged when training.

However, if two children of a node output same prediction, does the node keep step with its children? I don't know.

srowen · 2017-04-26T08:55:04Z

I was saying that I thought the nodes had more info than just the majority-class prediction. If they did, then they're much more rarely combinable, because they vary in more than just their prediction. They don't have a distribution over class labels or something like that, but they do carry impurity info. Can you merge two nodes with different impurity but the same prediction? this could be a dumb question, I actually am not sure if impurity info is even used after training.

facaiy · 2017-04-27T08:46:41Z

I have the same question with you. I guess that Impurity info is useful to debug and analysis tree model. However, as tree is grown from root to leaf when training, hence it seems needless to merge its sons.

sethah · 2017-04-28T05:22:13Z

I think the benefit of this would be for speed at predict time or for model storage. @srowen the nodes don't have to be equal to be merged, they just have to output the same prediction. Since this a param that can be turned on or off, I don't see a problem.

That said, I'd be interested to know how much of an impact this makes. This is a semi-large change and probably not at the top of the list right now. Maybe @jkbradley can comment.

HyukjinKwon · 2017-06-02T12:58:15Z

(gentle ping @jkbradley)

facaiy · 2017-07-04T13:31:16Z

@jkbradley May you have time reviewing the pr? I believe that it will be a little improvement for predict. Thanks.

facaiy · 2017-08-26T09:43:18Z

Hi, @yanboliang . Do you have time to take a look at first? Thanks very much.

WeichenXu123 · 2017-09-27T00:28:59Z

Can you do some benchmark to show how much improvement this change will bring ?

facaiy · 2017-09-27T02:45:52Z

HI, @WeichenXu123.

As said by @srowen , the benefit of this would be for speed at predict time or for model storage. Hence I'm not sure whether benchmark is really need for the PR.

AmplabJenkins · 2018-01-18T17:33:31Z

Can one of the admins verify this patch?

facaiy · 2018-03-03T23:42:27Z

Closed since its duplicate PR #20632 has been merged.

facaiy added 18 commits March 31, 2017 10:33

TST: create new test suite

fab2a0e

TST: helper method for construcing binary tree

f5d52cc

TST: helper method, show tree node info

b9248b7

TST: helper method, check if pairs of leave with same prediction exists

be12f4f

TST: helper method for modifying nodes

b524202

ENH: merge the pairs of leave with same prediction of same parent

98a73f9

ENH: add mergeLeave param in Strategy

632325d

ENH: support mergeChild when training

1205295

ENH: add canMergeChildren param in DecisionTreeParams

434c762

ENH: add set method in tree classifier

5162552

ENH: stat: merge counts of each tree

21b1a85

BUG: depth=0 tree has none of children

25b712a

TST: add comment for test suite

749dbd8

TST: add iris dataset

fbd1c9a

TST: helper method, check children of Node

f81e4e3

TST: add unit test, check setCanMergeChildren for DecisionTreeClassifier

129e6fe

TST: rename test case

3f49146

BUG: fix for terminal node whose isLeaf is false

1b42afb

CLN: format mergeCountsOfTreesInfo

93ffa3f

CLN: fix for code style

a8351f8

facaiy commented Apr 26, 2017

View reviewed changes

facaiy added 2 commits July 5, 2017 10:17

BUG: mima, fix binary compatibility

701806f

merge latest master and reslove conflict

a472c3a

facaiy closed this Mar 3, 2018

facaiy deleted the CLN/check_for_reducible_decision_tree branch March 3, 2018 23:42

[SPARK-3159][MLlib] Check for reducible DecisionTree #17503

[SPARK-3159][MLlib] Check for reducible DecisionTree #17503

Uh oh!

Conversation

facaiy commented Apr 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How was this patch tested?

Uh oh!

facaiy commented Apr 5, 2017

Uh oh!

facaiy commented Apr 24, 2017

Uh oh!

srowen commented Apr 24, 2017

Uh oh!

SparkQA commented Apr 24, 2017

Uh oh!

facaiy Apr 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facaiy commented Apr 26, 2017

Uh oh!

srowen commented Apr 26, 2017

Uh oh!

facaiy commented Apr 27, 2017

Uh oh!

sethah commented Apr 28, 2017

Uh oh!

HyukjinKwon commented Jun 2, 2017

Uh oh!

facaiy commented Jul 4, 2017

Uh oh!

facaiy commented Aug 26, 2017

Uh oh!

WeichenXu123 commented Sep 27, 2017

Uh oh!

facaiy commented Sep 27, 2017

Uh oh!

AmplabJenkins commented Jan 18, 2018

Uh oh!

facaiy commented Mar 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

facaiy commented Apr 1, 2017 •

edited

Loading

facaiy Apr 26, 2017 •

edited

Loading

facaiy commented Mar 3, 2018 •

edited

Loading