Skip to content

Conversation

@hazimehh
Copy link

@hazimehh hazimehh commented Jul 8, 2016

Summary

Add a new transformation method transformInstance that operates on single instances. This method can reduce the latency of predictions by 200x for typical ML tasks which facilitates serving models in production. See the JIRA ticket for details.

What changes were proposed in this pull request?

Current feature transformers in spark.ml can only operate on DataFrames and don't have a method that accepts single instances. A typical transformer has a User-Defined Function (udf) in its transform method which includes a set of operations on the features of a single instance:

val column_operation = udf {operations on single instance}

Adding a new method called transformInstance that operates directly on single instances and using it in the udf instead can be useful:

def transformInstance(features: featuresType): OutputType = {operations on single instance}

val column_operation = udf {transformInstance}

Predictors also don't have a public method that does predictions on single instances. transformInstance can be easily added to predictors by acting as a wrapper for the internal method predict (which takes features as input).

The proposed method in this change is added to all predictors and feature transformers except OnehotEncoder, VectorSlicer, and Word2Vec, which might require bigger changes due to dependencies on the dataset's schema (they can be fixed using simple hacks but this needs to be discussed)

How was this patch tested?

The current tests for transformers and predictors, which invoke transformInstance internally, passed.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@hazimehh
Copy link
Author

hazimehh commented Jul 11, 2016

@rxin @jkbradley @mengxr can you review this?

@rxin
Copy link
Contributor

rxin commented Jul 14, 2016

I don't know ML that well.

cc @jkbradley @thunterdb @dbtsai @yanboliang

@jkbradley
Copy link
Member

I just responded on the main JIRA. Can you please check that out and close this issue for now? Thanks!

vanzin pushed a commit to vanzin/spark that referenced this pull request Aug 4, 2016
Closing the following PRs due to requests or unresponsive users.

Closes apache#13923
Closes apache#14462
Closes apache#13123
Closes apache#14423 (requested by srowen)
Closes apache#14424 (requested by srowen)
Closes apache#14101 (requested by jkbradley)
Closes apache#10676 (requested by srowen)
Closes apache#10943 (requested by yhuai)
Closes apache#9936
Closes apache#10701
@asfgit asfgit closed this in 53e766c Aug 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants