Skip to content

Conversation

@amaliujia
Copy link
Contributor

What changes were proposed in this pull request?

  1. Support Range in Connect proto.
  2. Refactor SparkConnectDeduplicateSuite to SparkConnectSessionBasedSuite

Why are the changes needed?

Improve API coverage.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

@amaliujia
Copy link
Contributor Author

R: @cloud-fan

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end is not optional, but how do we know if the client forgets to set it? 0 is a valid end value as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this becomes tricky. Ultimately we can wrap every such field into a message so we always know if that field is set or not set. However that might complicate entire proto too much.. Let's have a discussion on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can call session.leafNodeDefaultParallelism

@amaliujia amaliujia marked this pull request as draft October 27, 2022 02:15
@amaliujia amaliujia marked this pull request as ready for review October 29, 2022 08:01
@amaliujia
Copy link
Contributor Author

PR should be ready for review again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we call spark.range(10).toDF then we don't need to add comparePlansDatasetLong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to see if it gives an exact plan.

Another idea might be we just compare the result through collect() so we do not compare the plan on this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh the .toDF() just convert things into DataFrame.

It has removed the comparePlansDatasetLong

@amaliujia
Copy link
Contributor Author

Conflict resolved

end: Int,
step: Option[Int],
numPartitions: Option[Int]): Relation = {
val range = proto.Range.newBuilder()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I need to keep proto.Range as Range itself is a built-in scala class so we need proto. to differentiate for this special case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been explicitly requesting this a couple of times already, as a coding style to always prefix the proto generated classes with their proto. prefix. I know it uses a little bit more horizontal space, but at the same time it makes always clear where this particular element comes from which is tremendously useful because we're consistently using the different types between the catalyst API and Spark Connect in the same code paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense for SparkConnectPlanner where Catalyst and Proto are both mixed together, and we are keeping the approach you are asking there.

However this is the Connect DSL that only deal with protos. No Catalyst included in this package:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as no catalyst is in this package this is good with me. Thanks for clarifying.

Comment on lines +217 to +231
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really the best way to express the optionality?

Copy link
Contributor Author

@amaliujia amaliujia Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two dimensions of things in this area:

  1. Required versus Optional.
    A field is required, meaning it must be set. A field can be optional. Meaning it could be set or not.

  2. Field has default value or not.
    A field can have a default value if not set.

The second point is an addition for the first point. If there is a field which is not set, there could be a default value to be used.

There are special cases that the default value for proto, is the same as the default value that Spark uses. In that case we don't need to differentiate the optionality. Otherwise we need this way to differentiate set versus not set, to adopt default values of Spark (unless we don't care the default values in Spark).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To really answer your question: if we plan to respect default values for Spark for those optionally fields whose default proto values are different from Spark default values, this is the only way to respect default values for Spark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in fewer words :) when num_partitions is an integer the default value is 0 even if it's not and for scalar types we can't differentiate between present or not. Understanding if 0 is a valid or invalid value defeats the purpose.

Thanks for the additional color!

end: Int,
step: Option[Int],
numPartitions: Option[Int]): Relation = {
val range = proto.Range.newBuilder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been explicitly requesting this a couple of times already, as a coding style to always prefix the proto generated classes with their proto. prefix. I know it uses a little bit more horizontal space, but at the same time it makes always clear where this particular element comes from which is tremendously useful because we're consistently using the different types between the catalyst API and Spark Connect in the same code paths.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in b3ed0c1 Nov 1, 2022
int32 start = 1;
int32 end = 2;
// Optional. Default value = 1
Step step = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start, end, step should use int64 @amaliujia

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let me follow up. I guess I was looking at python side API somehow thus confused myself on the types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating in #38460.

SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?

1. Support `Range` in Connect proto.
2. Refactor `SparkConnectDeduplicateSuite`  to `SparkConnectSessionBasedSuite`

### Why are the changes needed?

Improve API coverage.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT

Closes apache#38347 from amaliujia/add_range.

Authored-by: Rui Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants