Skip to content
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion python/pyspark/ml/feature.py
Original file line number Diff line number Diff line change
Expand Up @@ -1178,7 +1178,17 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadab

`QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
categorical features. The number of bins can be set using the :py:attr:`numBuckets` parameter.
The bin ranges are chosen using an approximate algorithm (see the documentation for
It is possible that the number of buckets used will be less than this value, for example, if
there are too few distinct values of the input to create enough distinct quantiles.

NaN handling: Note also that
QuantileDiscretizer will raise an error when it finds NaN values in the dataset, but the user
can also choose to either keep or remove NaN values within the dataset by setting
`handleInvalid`. If the user chooses to keep NaN values, they will be handled specially and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we maybe link this with a py attr like we did with numBuckets?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, sure. Thanks for pointing that out... ;)

placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be
put into buckets[0-3], but NaNs will be counted in a special bucket[4].

Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
:py:meth:`~.DataFrameStatFunctions.approxQuantile` for a detailed description).
The precision of the approximation can be controlled with the
:py:attr:`relativeError` parameter.
Expand Down