Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
1dc4579
SPARK-9654 Add string indexer inverse in PySpark
holdenk Aug 5, 2015
0445fcc
doc fix
holdenk Aug 5, 2015
af2f869
Don't changge the base class init, fill out the doctest for the invert.
holdenk Aug 6, 2015
510bce5
remove extra blank line
holdenk Aug 6, 2015
c6da160
get rid of unicude specificers in doctest
holdenk Aug 6, 2015
9f5af3a
Deal with the difference between 2.X and 3.X with the output by just …
holdenk Aug 6, 2015
7b3b5ca
Use the standard constructor method for the StringIndexInverse
holdenk Aug 12, 2015
244e083
Update for index to string changeover
holdenk Aug 14, 2015
e95b61b
Move the property on to the model, remove references to old class name
holdenk Aug 14, 2015
b1795aa
CR feedback
holdenk Aug 18, 2015
ab90dcd
switch link to pydoc style
holdenk Aug 18, 2015
43ae197
Merge in master
holdenk Aug 18, 2015
c400e16
remove getLabels function (CR feedback) now that labels is public.
holdenk Aug 18, 2015
64de5c9
Some CR feedback
holdenk Aug 28, 2015
2316a90
Use None instead of empty array
holdenk Aug 28, 2015
15390bb
merge in master
holdenk Sep 1, 2015
28afcfd
Some CR feedback (note: still sorting our one of the params)
holdenk Sep 1, 2015
f19445d
Change description text
holdenk Sep 1, 2015
51ae7ee
merge in master
holdenk Sep 1, 2015
ed0ca91
moar merge
holdenk Sep 1, 2015
8fca8b3
punctuation
holdenk Sep 1, 2015
3ef852f
remove unrelated change
holdenk Sep 1, 2015
41d0d27
long line fix
holdenk Sep 1, 2015
cd5d418
Add missing period
holdenk Sep 9, 2015
4f56b17
Fix link to transformer class, copy scala doc for labels
holdenk Sep 9, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Use the standard constructor method for the StringIndexInverse
  • Loading branch information
holdenk committed Aug 14, 2015
commit 7b3b5ca2c5c4acfa61d444bf2d6c3867e8dfef95
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,11 @@ class StringIndexerModel (
map
}

/**
* The labels used for applying this transformation
*/
private[spark] def getLabels() = labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer needed since "label" is a public val


/** @group setParam */
def setHandleInvalid(value: String): this.type = set(handleInvalid, value)
setDefault(handleInvalid, "error")
Expand Down
49 changes: 44 additions & 5 deletions python/pyspark/ml/feature.py
Original file line number Diff line number Diff line change
Expand Up @@ -731,7 +731,8 @@ class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol):
>>> sorted(set([(i[0], i[1]) for i in td.select(td.id, td.indexed).collect()]),
... key=lambda x: x[0])
[(0, 0.0), (1, 2.0), (2, 1.0), (3, 0.0), (4, 0.0), (5, 1.0)]
>>> itd = model.invert("indexed", "label2").transform(td)
>>> inverter = model.invert("indexed", "label2")
>>> itd = inverter.transform(td)
>>> sorted(set([(i[0], str(i[1])) for i in itd.select(itd.id, itd.label2).collect()]),
... key=lambda x: x[0])
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'a'), (4, 'a'), (5, 'c')]
Expand Down Expand Up @@ -771,22 +772,60 @@ def invert(self, inputCol, outputCol):
Note: By default we keep the original columns during this transformation, so the inverse
should only be used on new columns such as predicted labels.
"""
return StringIndexerInverse(self._java_obj.invert(inputCol, outputCol))
labels = self._java_obj.getLabels()
return StringIndexerInverse(inputCol=inputCol, outputCol=outputCol,
labels=labels)


class StringIndexerInverse(JavaTransformer):
class StringIndexerInverse(JavaTransformer, HasInputCol, HasOutputCol):
"""
Transform a provided column back to the original input types using the metadata on
the input column.
Note: By default we keep the original columns during StringIndexerModel's transformation,
so the inverse should only be used on new columns such as predicted labels.
"""
# a placeholder to make the labels show up in generated doc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert newline above

labels = Param(Params._dummy(), "lables",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "lables"

"Optional labels to be provided by the user, if not supplied column " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "if not supplied" -> "if equal to the empty array then"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes less sense, if it isn't supplied then it uses the column metadata.

"metadata is read for labels. The default value is an empty array, " +
"but the empty array is ignored and column metadata used instead.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the above nit, this becomes redundant IMO. Since this is a matter of taste, feel free to keep or cut


def __init__(self, java_obj):
@keyword_only
def __init__(self, inputCol=None, outputCol=None, labels=[]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid using mutable values [] as defaults in Python, let's use None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is the underlying Scala code uses an empty array as the default.

"""
Initialize this instace of the StringIndexerInverse using the provided java_obj.
"""
self._java_obj = java_obj
super(StringIndexerInverse, self).__init__()
self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.StringIndexerInverse",
self.uid)
self.labels = Param(self, "labels",
"Optional labels to be provided by the user, if not supplied column " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as L957

"metadata is read for labels. The default value is an empty array, " +
"but the empty array is ignored and column metadata used instead.")
kwargs = self.__init__._input_kwargs
self.setParams(**kwargs)

@keyword_only
def setParams(self, inputCol=None, outputCol=None, labels=[]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, using None rather than [].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is the underlying Scala code uses an empty array as the default.

"""
setParams(self, inputCol="input", outputCol="output", labels=[])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct col defaults: None

Sets params for this StringIndexerInverse
"""
kwargs = self.setParams._input_kwargs
return self._set(**kwargs)

def setLabels(self, value):
"""
Specify the labels to be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sets the value of :py:attr:labels.
Sphinx will produce link for this param.

"""
self._paramMap[self.labels] = value
return self

def getLabels(self):
"""
Get the labels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gets the value of labels or its default value.

"""
return self.getOrDefault(self.labels)


@inherit_doc
Expand Down