Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion python/pyspark/rdd.py
Original file line number Diff line number Diff line change
Expand Up @@ -2067,7 +2067,7 @@ def add_shuffle_key(split, iterator):
avg = int(size / n) >> 20
# let 1M < avg < 10M
if avg < 1:
batch *= 1.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually when get_used_memory() > limit is true, I don't know why we want to increase batch *= 1.5.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess to increase the size of batch and to use more memory .. ?

Copy link
Member

@viirya viirya May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm..I thought increasing batch is for c > batch. In other words, it increases the size of batch if it reaches the current batch size, but used memory is still under limit (and the average size of bucket is small).

If it reaches memory limit before reaching the batch size (so it means current batch size is more than memory limit), it seems not make sense to increase batch size (even the average size of bucket is small).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. the batch size should not increase when reaching the memory limit

batch = min(sys.maxsize, batch * 1.5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use sys.float_info.max instead, @nolanliou? Hm, btw just realised that it's funny that it can reach to the maximum of float ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway avoiding failure on the batch size makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sys.maxsize is ok.

It’s not easy to encounter this problem, but I ran into it...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, sys.maxsize batch size already doesn't make much sense anyway.

elif avg > 10:
batch = max(int(batch / 1.5), 1)
c = 0
Expand Down