-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35512][PYTHON] Fix OverflowError(cannot convert float infinity to integer) in partitionBy function #32667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
ok to test |
|
@nolanliou please take a look at https://github.com/apache/spark/pull/32667/checks?check_run_id=2671465975 and enable Github Actions. Apache Spark repositry uses the resources in each forked repository in PR builds. |
python/pyspark/rdd.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually when get_used_memory() > limit is true, I don't know why we want to increase batch *= 1.5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess to increase the size of batch and to use more memory .. ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm..I thought increasing batch is for c > batch. In other words, it increases the size of batch if it reaches the current batch size, but used memory is still under limit (and the average size of bucket is small).
If it reaches memory limit before reaching the batch size (so it means current batch size is more than memory limit), it seems not make sense to increase batch size (even the average size of bucket is small).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. the batch size should not increase when reaching the memory limit
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable. Although I have a question why we increase batch when get_used_memory() > limit.
|
Test build #138957 has finished for PR 32667 at commit
|
python/pyspark/rdd.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use sys.float_info.max instead, @nolanliou? Hm, btw just realised that it's funny that it can reach to the maximum of float ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway avoiding failure on the batch size makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think sys.maxsize is ok.
It’s not easy to encounter this problem, but I ran into it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, sys.maxsize batch size already doesn't make much sense anyway.
|
Kubernetes integration test starting |
9710887 to
b6241fa
Compare
I have enabled actions(Allow all actions), but still not work... |
|
@nolanliou did you face something like this: #32400 (comment)? |
b6241fa to
67e6d71
Compare
|
Kubernetes integration test status failure |
|
Test build #138964 has finished for PR 32667 at commit
|
|
Kubernetes integration test starting |
|
Test build #138973 has finished for PR 32667 at commit
|
|
Kubernetes integration test status success |
All tests passed? |
|
Let's wait for few days to make sure other people can review. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Any updates? |
|
retest this please |
|
Test build #139511 has finished for PR 32667 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Merged to master. |
What changes were proposed in this pull request?
Limit the batch size for
add_shuffle_keyinpartitionByfunction to fixOverflowError: cannot convert float infinity to integerWhy are the changes needed?
It's not easy to write a UT, but I can use some simple code to explain the bug.
if
get_used_memory() > limitalways isTrueandavg < 1always isTrue, the variablebatchwill grow to infinity. thenbatch = max(int(batch / 1.5), 1)may raiseOverflowErrorifavg > 10at some time.Does this PR introduce any user-facing change?
no
How was this patch tested?
It's not easy to write a UT, there is sample code to test