-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14267] [SQL] [PYSPARK] execute multiple Python UDFs within single batch #12057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
f6b7373
8e6e5bc
8dc1adf
dd71ba9
8597bba
72a5ec0
876f9f9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -71,7 +71,6 @@ case class BatchPythonEvaluation(udfs: Seq[PythonUDF], output: Seq[Attribute], c | |
|
|
||
| val (pyFuncs, children) = udfs.map(collectFunctions).unzip | ||
| val numArgs = children.map(_.length) | ||
| val resultType = StructType(udfs.map(u => StructField("", u.dataType, u.nullable))) | ||
|
|
||
| val pickle = new Pickler | ||
| // flatten all the arguments | ||
|
|
@@ -97,15 +96,26 @@ case class BatchPythonEvaluation(udfs: Seq[PythonUDF], output: Seq[Attribute], c | |
| .compute(inputIterator, context.partitionId(), context) | ||
|
|
||
| val unpickle = new Unpickler | ||
| val row = new GenericMutableRow(1) | ||
| val mutableRow = new GenericMutableRow(1) | ||
| val joined = new JoinedRow | ||
| val resultType = if (udfs.length == 1) { | ||
| udfs.head.dataType | ||
| } else { | ||
| StructType(udfs.map(u => StructField("", u.dataType, u.nullable))) | ||
| } | ||
| val resultProj = UnsafeProjection.create(output, output) | ||
|
|
||
| outputIterator.flatMap { pickedResult => | ||
| val unpickledBatch = unpickle.loads(pickedResult) | ||
| unpickledBatch.asInstanceOf[java.util.ArrayList[Any]].asScala | ||
| }.map { result => | ||
| val row = EvaluatePython.fromJava(result, resultType).asInstanceOf[InternalRow] | ||
| val row = if (udfs.length == 1) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rather than evaluating this
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you do this, you could reduce the scope of the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Comparing evaluate Python UDF, I think this does not matter, JIT compiler could predict this branch pretty easy.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair enough. |
||
| // fast path for single UDF | ||
| mutableRow(0) = EvaluatePython.fromJava(result, resultType) | ||
| mutableRow | ||
| } else { | ||
| EvaluatePython.fromJava(result, resultType).asInstanceOf[InternalRow] | ||
| } | ||
| resultProj(joined(queue.poll(), row)) | ||
| } | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet you could even do
mapper = udfif you wanted to.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't, input of mapper is a tuple, but udf is not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. Makes sense.