[SPARK-35512][PYTHON] Fix OverflowError(cannot convert float infinity to integer) in partitionBy function #32667

nolanliou · 2021-05-26T03:26:54Z

What changes were proposed in this pull request?

Limit the batch size for add_shuffle_key in partitionBy function to fix OverflowError: cannot convert float infinity to integer

Why are the changes needed?

It's not easy to write a UT, but I can use some simple code to explain the bug.

Original code

        def add_shuffle_key(split, iterator):

            buckets = defaultdict(list)
            c, batch = 0, min(10 * numPartitions, 1000)

            for k, v in iterator:
                buckets[partitionFunc(k) % numPartitions].append((k, v))
                c += 1

                # check used memory and avg size of chunk of objects
                if (c % 1000 == 0 and get_used_memory() > limit
                        or c > batch):
                    n, size = len(buckets), 0
                    for split in list(buckets.keys()):
                        yield pack_long(split)
                        d = outputSerializer.dumps(buckets[split])
                        del buckets[split]
                        yield d
                        size += len(d)

                    avg = int(size / n) >> 20
                    # let 1M < avg < 10M
                    if avg < 1:
                        batch *= 1.5
                    elif avg > 10:
                        batch = max(int(batch / 1.5), 1)
                    c = 0

if get_used_memory() > limit always is True and avg < 1 always is True, the variable batch will grow to infinity. then batch = max(int(batch / 1.5), 1) may raise OverflowError if avg > 10 at some time.

sample code to reproduce the bug

import sys

limit = 100
used_memory = 200
numPartitions = 64
c, batch = 0, min(10 * numPartitions, 1000)


while True:
    c += 1
    if (c % 1000 == 0 and used_memory > limit or c > batch):
        batch = batch * 1.5
        d = max(int(batch / 1.5), 1)
        print(c, batch)

Does this PR introduce any user-facing change?

no

How was this patch tested?

It's not easy to write a UT, there is sample code to test

import sys


limit = 100
used_memory = 200
numPartitions = 64
c, batch = 0, min(10 * numPartitions, 1000)


while True:
    c += 1
    if (c % 1000 == 0 and used_memory > limit or c > batch):
        batch = min(sys.maxsize, batch * 1.5)
        d = max(int(batch / 1.5), 1)
        print(c, batch)

HyukjinKwon

Looks making sense. cc @viirya and @ueshin too for double checking.

HyukjinKwon · 2021-05-26T04:08:11Z

ok to test

HyukjinKwon · 2021-05-26T04:08:41Z

@nolanliou please take a look at https://github.com/apache/spark/pull/32667/checks?check_run_id=2671465975 and enable Github Actions. Apache Spark repositry uses the resources in each forked repository in PR builds.

viirya · 2021-05-26T04:52:54Z

python/pyspark/rdd.py

Actually when get_used_memory() > limit is true, I don't know why we want to increase batch *= 1.5.

I guess to increase the size of batch and to use more memory .. ?

Hm..I thought increasing batch is for c > batch. In other words, it increases the size of batch if it reaches the current batch size, but used memory is still under limit (and the average size of bucket is small).

If it reaches memory limit before reaching the batch size (so it means current batch size is more than memory limit), it seems not make sense to increase batch size (even the average size of bucket is small).

Agree. the batch size should not increase when reaching the memory limit

viirya

Looks reasonable. Although I have a question why we increase batch when get_used_memory() > limit.

SparkQA · 2021-05-26T04:58:04Z

Test build #138957 has finished for PR 32667 at commit 9710887.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-05-26T05:20:02Z

python/pyspark/rdd.py

should we use sys.float_info.max instead, @nolanliou? Hm, btw just realised that it's funny that it can reach to the maximum of float ...

Anyway avoiding failure on the batch size makes sense.

I think sys.maxsize is ok.

It’s not easy to encounter this problem, but I ran into it...

okay, sys.maxsize batch size already doesn't make much sense anyway.

SparkQA · 2021-05-26T05:27:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43477/

nolanliou · 2021-05-26T06:29:18Z

@nolanliou please take a look at https://github.com/apache/spark/pull/32667/checks?check_run_id=2671465975 and enable Github Actions. Apache Spark repositry uses the resources in each forked repository in PR builds.

I have enabled actions(Allow all actions), but still not work...

HyukjinKwon · 2021-05-26T06:44:44Z

@nolanliou did you face something like this: #32400 (comment)?

SparkQA · 2021-05-26T06:59:06Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43477/

SparkQA · 2021-05-26T07:04:39Z

Test build #138964 has finished for PR 32667 at commit b6241fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-26T07:28:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43483/

SparkQA · 2021-05-26T08:00:27Z

Test build #138973 has finished for PR 32667 at commit 67e6d71.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-26T08:07:18Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43483/

nolanliou · 2021-05-26T09:04:51Z

@nolanliou did you face something like this: #32400 (comment)?

All tests passed?

HyukjinKwon · 2021-05-26T09:15:52Z

Let's wait for few days to make sure other people can review.

SparkQA · 2021-05-26T09:21:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43493/

SparkQA · 2021-05-26T09:54:34Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43493/

nolanliou · 2021-06-08T12:34:29Z

Any updates?

viirya · 2021-06-08T23:19:21Z

retest this please

SparkQA · 2021-06-08T23:42:47Z

Test build #139511 has finished for PR 32667 at commit 67e6d71.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-09T00:12:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44037/

SparkQA · 2021-06-09T00:44:12Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44037/

HyukjinKwon · 2021-06-09T01:57:18Z

Merged to master.

github-actions bot added CORE PYTHON labels May 26, 2021

HyukjinKwon approved these changes May 26, 2021

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-35512][PYTHON]: fix OverflowError(cannot convert float infinity to integer) in partitionBy function~~ [SPARK-35512][PYTHON] Fix OverflowError(cannot convert float infinity to integer) in partitionBy function May 26, 2021

viirya reviewed May 26, 2021

View reviewed changes

HyukjinKwon reviewed May 26, 2021

View reviewed changes

nolanliou force-pushed the fix_partitionby branch from 9710887 to b6241fa Compare May 26, 2021 06:14

fix(pyspark): fix OverflowError in partitionBy function

67e6d71

nolanliou force-pushed the fix_partitionby branch from b6241fa to 67e6d71 Compare May 26, 2021 06:50

viirya approved these changes Jun 8, 2021

View reviewed changes

HyukjinKwon closed this in e79dd89 Jun 9, 2021

nolanliou deleted the fix_partitionby branch June 9, 2021 03:18

[SPARK-35512][PYTHON] Fix OverflowError(cannot convert float infinity to integer) in partitionBy function #32667

[SPARK-35512][PYTHON] Fix OverflowError(cannot convert float infinity to integer) in partitionBy function #32667

Uh oh!

Conversation

nolanliou commented May 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 26, 2021

Uh oh!

HyukjinKwon commented May 26, 2021

Uh oh!

viirya May 26, 2021

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 26, 2021

Choose a reason for hiding this comment

Uh oh!

viirya May 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nolanliou May 27, 2021

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 26, 2021

Uh oh!

HyukjinKwon May 26, 2021

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 26, 2021

Choose a reason for hiding this comment

Uh oh!

nolanliou May 26, 2021

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 26, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 26, 2021

Uh oh!

nolanliou commented May 26, 2021

Uh oh!

HyukjinKwon commented May 26, 2021

Uh oh!

SparkQA commented May 26, 2021

Uh oh!

SparkQA commented May 26, 2021

Uh oh!

SparkQA commented May 26, 2021

Uh oh!

SparkQA commented May 26, 2021

Uh oh!

SparkQA commented May 26, 2021

Uh oh!

nolanliou commented May 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 26, 2021

Uh oh!

SparkQA commented May 26, 2021

Uh oh!

SparkQA commented May 26, 2021

Uh oh!

nolanliou commented Jun 8, 2021

Uh oh!

viirya commented Jun 8, 2021

Uh oh!

SparkQA commented Jun 8, 2021

Uh oh!

nolanliou commented May 26, 2021 •

edited

Loading

HyukjinKwon left a comment •

edited

Loading

viirya May 26, 2021 •

edited

Loading

nolanliou commented May 26, 2021 •

edited

Loading