-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-2851] [mllib] DecisionTree Python consistency update #1798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 6 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
eaf84c0
Added DecisionTree static train() methods API to match Python, but wi…
jkbradley c699850
a few doc comments
jkbradley e358661
DecisionTree API change:
jkbradley fe6dbfa
removed unnecessary imports
jkbradley 00f820e
Merge remote-tracking branch 'upstream/master' into dt-python-consist…
jkbradley ee1d236
DecisionTree API updates:
jkbradley a0d7dbe
DecisionTree: In Java-friendly train* methods, changed to use JavaRDD…
jkbradley 6f7edf8
Merge remote-tracking branch 'upstream/master' into dt-python-consist…
jkbradley File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,14 +17,16 @@ | |
|
|
||
| package org.apache.spark.mllib.tree | ||
|
|
||
| import scala.collection.JavaConverters._ | ||
|
|
||
| import org.apache.spark.annotation.Experimental | ||
| import org.apache.spark.Logging | ||
| import org.apache.spark.mllib.regression.LabeledPoint | ||
| import org.apache.spark.mllib.tree.configuration.Strategy | ||
| import org.apache.spark.mllib.tree.configuration.{Algo, Strategy} | ||
| import org.apache.spark.mllib.tree.configuration.Algo._ | ||
| import org.apache.spark.mllib.tree.configuration.FeatureType._ | ||
| import org.apache.spark.mllib.tree.configuration.QuantileStrategy._ | ||
| import org.apache.spark.mllib.tree.impurity.Impurity | ||
| import org.apache.spark.mllib.tree.impurity.{Impurities, Gini, Entropy, Impurity} | ||
| import org.apache.spark.mllib.tree.model._ | ||
| import org.apache.spark.rdd.RDD | ||
| import org.apache.spark.util.random.XORShiftRandom | ||
|
|
@@ -200,6 +202,10 @@ object DecisionTree extends Serializable with Logging { | |
| * Method to train a decision tree model. | ||
| * The method supports binary and multiclass classification and regression. | ||
| * | ||
| * Note: Using [[org.apache.spark.mllib.tree.DecisionTree$#trainClassifier]] | ||
| * and [[org.apache.spark.mllib.tree.DecisionTree$#trainRegressor]] | ||
| * is recommended to clearly separate classification and regression. | ||
| * | ||
| * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]. | ||
| * For classification, labels should take values {0, 1, ..., numClasses-1}. | ||
| * For regression, labels are real numbers. | ||
|
|
@@ -213,10 +219,12 @@ object DecisionTree extends Serializable with Logging { | |
| } | ||
|
|
||
| /** | ||
| * Method to train a decision tree model where the instances are represented as an RDD of | ||
| * (label, features) pairs. The method supports binary classification and regression. For the | ||
| * binary classification, the label for each instance should either be 0 or 1 to denote the two | ||
| * classes. | ||
| * Method to train a decision tree model. | ||
| * The method supports binary and multiclass classification and regression. | ||
| * | ||
| * Note: Using [[org.apache.spark.mllib.tree.DecisionTree$#trainClassifier]] | ||
| * and [[org.apache.spark.mllib.tree.DecisionTree$#trainRegressor]] | ||
| * is recommended to clearly separate classification and regression. | ||
| * | ||
| * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]. | ||
| * For classification, labels should take values {0, 1, ..., numClasses-1}. | ||
|
|
@@ -237,10 +245,12 @@ object DecisionTree extends Serializable with Logging { | |
| } | ||
|
|
||
| /** | ||
| * Method to train a decision tree model where the instances are represented as an RDD of | ||
| * (label, features) pairs. The method supports binary classification and regression. For the | ||
| * binary classification, the label for each instance should either be 0 or 1 to denote the two | ||
| * classes. | ||
| * Method to train a decision tree model. | ||
| * The method supports binary and multiclass classification and regression. | ||
| * | ||
| * Note: Using [[org.apache.spark.mllib.tree.DecisionTree$#trainClassifier]] | ||
| * and [[org.apache.spark.mllib.tree.DecisionTree$#trainRegressor]] | ||
| * is recommended to clearly separate classification and regression. | ||
| * | ||
| * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]. | ||
| * For classification, labels should take values {0, 1, ..., numClasses-1}. | ||
|
|
@@ -263,11 +273,12 @@ object DecisionTree extends Serializable with Logging { | |
| } | ||
|
|
||
| /** | ||
| * Method to train a decision tree model where the instances are represented as an RDD of | ||
| * (label, features) pairs. The decision tree method supports binary classification and | ||
| * regression. For the binary classification, the label for each instance should either be 0 or | ||
| * 1 to denote the two classes. The method also supports categorical features inputs where the | ||
| * number of categories can specified using the categoricalFeaturesInfo option. | ||
| * Method to train a decision tree model. | ||
| * The method supports binary and multiclass classification and regression. | ||
| * | ||
| * Note: Using [[org.apache.spark.mllib.tree.DecisionTree$#trainClassifier]] | ||
| * and [[org.apache.spark.mllib.tree.DecisionTree$#trainRegressor]] | ||
| * is recommended to clearly separate classification and regression. | ||
| * | ||
| * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]. | ||
| * For classification, labels should take values {0, 1, ..., numClasses-1}. | ||
|
|
@@ -279,11 +290,9 @@ object DecisionTree extends Serializable with Logging { | |
| * @param numClassesForClassification number of classes for classification. Default value of 2. | ||
| * @param maxBins maximum number of bins used for splitting features | ||
| * @param quantileCalculationStrategy algorithm for calculating quantiles | ||
| * @param categoricalFeaturesInfo A map storing information about the categorical variables and | ||
| * the number of discrete values they take. For example, | ||
| * an entry (n -> k) implies the feature n is categorical with k | ||
| * categories 0, 1, 2, ... , k-1. It's important to note that | ||
| * features are zero-indexed. | ||
| * @param categoricalFeaturesInfo Map storing arity of categorical features. | ||
| * E.g., an entry (n -> k) indicates that feature n is categorical | ||
| * with k categories indexed from 0: {0, 1, ..., k-1}. | ||
| * @return DecisionTreeModel that can be used for prediction | ||
| */ | ||
| def train( | ||
|
|
@@ -300,6 +309,93 @@ object DecisionTree extends Serializable with Logging { | |
| new DecisionTree(strategy).train(input) | ||
| } | ||
|
|
||
| /** | ||
| * Method to train a decision tree model for binary or multiclass classification. | ||
| * | ||
| * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]. | ||
| * Labels should take values {0, 1, ..., numClasses-1}. | ||
| * @param numClassesForClassification number of classes for classification. | ||
| * @param categoricalFeaturesInfo Map storing arity of categorical features. | ||
| * E.g., an entry (n -> k) indicates that feature n is categorical | ||
| * with k categories indexed from 0: {0, 1, ..., k-1}. | ||
| * @param impurity Criterion used for information gain calculation. | ||
| * Supported values: "gini" (recommended) or "entropy". | ||
| * @param maxDepth Maximum depth of the tree. | ||
| * E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. | ||
| * (suggested value: 4) | ||
| * @param maxBins maximum number of bins used for splitting features | ||
| * (suggested value: 100) | ||
| * @return DecisionTreeModel that can be used for prediction | ||
| */ | ||
| def trainClassifier( | ||
| input: RDD[LabeledPoint], | ||
| numClassesForClassification: Int, | ||
| categoricalFeaturesInfo: Map[Int, Int], | ||
| impurity: String, | ||
| maxDepth: Int, | ||
| maxBins: Int): DecisionTreeModel = { | ||
| val impurityType = Impurities.fromString(impurity) | ||
| train(input, Classification, impurityType, maxDepth, numClassesForClassification, maxBins, Sort, | ||
| categoricalFeaturesInfo) | ||
| } | ||
|
|
||
| /** | ||
| * Java-friendly API for [[org.apache.spark.mllib.tree.DecisionTree$#trainClassifier]] | ||
| */ | ||
| def trainClassifier( | ||
| input: RDD[LabeledPoint], | ||
| numClassesForClassification: Int, | ||
| categoricalFeaturesInfo: java.util.Map[java.lang.Integer, java.lang.Integer], | ||
| impurity: String, | ||
| maxDepth: Int, | ||
| maxBins: Int): DecisionTreeModel = { | ||
| trainClassifier(input, numClassesForClassification, | ||
| categoricalFeaturesInfo.asInstanceOf[java.util.Map[Int, Int]].asScala.toMap, | ||
| impurity, maxDepth, maxBins) | ||
| } | ||
|
|
||
| /** | ||
| * Method to train a decision tree model for regression. | ||
| * | ||
| * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]. | ||
| * Labels are real numbers. | ||
| * @param categoricalFeaturesInfo Map storing arity of categorical features. | ||
| * E.g., an entry (n -> k) indicates that feature n is categorical | ||
| * with k categories indexed from 0: {0, 1, ..., k-1}. | ||
| * @param impurity Criterion used for information gain calculation. | ||
| * Supported values: "variance". | ||
| * @param maxDepth Maximum depth of the tree. | ||
| * E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. | ||
| * (suggested value: 4) | ||
| * @param maxBins maximum number of bins used for splitting features | ||
| * (suggested value: 100) | ||
| * @return DecisionTreeModel that can be used for prediction | ||
| */ | ||
| def trainRegressor( | ||
| input: RDD[LabeledPoint], | ||
| categoricalFeaturesInfo: Map[Int, Int], | ||
| impurity: String, | ||
| maxDepth: Int, | ||
| maxBins: Int): DecisionTreeModel = { | ||
| val impurityType = Impurities.fromString(impurity) | ||
| train(input, Regression, impurityType, maxDepth, 0, maxBins, Sort, categoricalFeaturesInfo) | ||
| } | ||
|
|
||
| /** | ||
| * Java-friendly API for [[org.apache.spark.mllib.tree.DecisionTree$#trainRegressor]] | ||
| */ | ||
| def trainRegressor( | ||
| input: RDD[LabeledPoint], | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto: |
||
| categoricalFeaturesInfo: java.util.Map[java.lang.Integer, java.lang.Integer], | ||
| impurity: String, | ||
| maxDepth: Int, | ||
| maxBins: Int): DecisionTreeModel = { | ||
| trainRegressor(input, | ||
| categoricalFeaturesInfo.asInstanceOf[java.util.Map[Int, Int]].asScala.toMap, | ||
| impurity, maxDepth, maxBins) | ||
| } | ||
|
|
||
|
|
||
| private val InvalidBinIndex = -1 | ||
|
|
||
| /** | ||
|
|
@@ -1331,16 +1427,15 @@ object DecisionTree extends Serializable with Logging { | |
| * Categorical features: | ||
| * For each feature, there is 1 bin per split. | ||
| * Splits and bins are handled in 2 ways: | ||
| * (a) For multiclass classification with a low-arity feature | ||
| * (a) "unordered features" | ||
| * For multiclass classification with a low-arity feature | ||
| * (i.e., if isMulticlass && isSpaceSufficientForAllCategoricalSplits), | ||
| * the feature is split based on subsets of categories. | ||
| * There are 2^(maxFeatureValue - 1) - 1 splits. | ||
| * (b) For regression and binary classification, | ||
| * There are math.pow(2, maxFeatureValue - 1) - 1 splits. | ||
| * (b) "ordered features" | ||
| * For regression and binary classification, | ||
| * and for multiclass classification with a high-arity feature, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| * there is one split per category. | ||
|
|
||
| * Categorical case (a) features are called unordered features. | ||
| * Other cases are called ordered features. | ||
| * there is one bin per category. | ||
| * | ||
| * @param input Training data: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]] | ||
| * @param strategy [[org.apache.spark.mllib.tree.configuration.Strategy]] instance containing | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
32 changes: 32 additions & 0 deletions
32
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurities.scala
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.mllib.tree.impurity | ||
|
|
||
| /** | ||
| * Factory for Impurity. | ||
| */ | ||
| private[mllib] object Impurities { | ||
|
|
||
| def fromString(name: String): Impurity = name match { | ||
| case "gini" => Gini | ||
| case "entropy" => Entropy | ||
| case "variance" => Variance | ||
| case _ => throw new IllegalArgumentException(s"Did not recognize Impurity name: $name") | ||
| } | ||
|
|
||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RDD->JavaRDD. Sorry I missed this in the first pass.