[SPARK-16431] [ML] Add a unified method that accepts single instances to feature transformers and predictors #14101
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Add a new transformation method
transformInstancethat operates on single instances. This method can reduce the latency of predictions by 200x for typical ML tasks which facilitates serving models in production. See the JIRA ticket for details.What changes were proposed in this pull request?
Current feature transformers in spark.ml can only operate on DataFrames and don't have a method that accepts single instances. A typical transformer has a User-Defined Function (udf) in its
transformmethod which includes a set of operations on the features of a single instance:Adding a new method called
transformInstancethat operates directly on single instances and using it in the udf instead can be useful:Predictors also don't have a public method that does predictions on single instances.
transformInstancecan be easily added to predictors by acting as a wrapper for the internal method predict (which takes features as input).The proposed method in this change is added to all predictors and feature transformers except OnehotEncoder, VectorSlicer, and Word2Vec, which might require bigger changes due to dependencies on the dataset's schema (they can be fixed using simple hacks but this needs to be discussed)
How was this patch tested?
The current tests for transformers and predictors, which invoke
transformInstanceinternally, passed.