Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
2a1cd27
optimization + test
bogdanrdc Aug 23, 2018
421ee20
debug benchmark + early batch
bogdanrdc Aug 23, 2018
d7e49e7
revert benchmark
bogdanrdc Aug 23, 2018
ba6d91e
Merge remote-tracking branch 'upstream/master' into local-relation-fi…
bogdanrdc Aug 24, 2018
326e5d7
test fix
bogdanrdc Aug 24, 2018
4263bd2
[SPARK-25073][YARN] AM and Executor Memory validation message is not …
sujith71955 Aug 24, 2018
f84d256
[SPARK-25214][SS] Fix the issue that Kafka v2 source may return dupli…
zsxwing Aug 24, 2018
c721895
[SPARK-25174][YARN] Limit the size of diagnostic message for am to un…
yaooqinn Aug 24, 2018
f8536e3
[SPARK-25234][SPARKR] avoid integer overflow in parallelize
mengxr Aug 24, 2018
af6a91e
Correct missing punctuation in the documentation
Aug 25, 2018
c613c6b
[MINOR] Fix Scala 2.12 build
dbtsai Aug 25, 2018
ee1c0e8
[SPARK-24688][EXAMPLES] Modify the comments about LabeledPoint
huangweizhe123 Aug 25, 2018
77fb55e
[SPARK-25214][SS][FOLLOWUP] Fix the issue that Kafka v2 source may re…
zsxwing Aug 25, 2018
b00824c
[SPARK-23792][DOCS] Documentation improvements for datetime functions
abradbury Aug 26, 2018
ee6cb6c
[SPARK-23698][PYTHON][FOLLOWUP] Resolve undefiend names in setup.py
HyukjinKwon Aug 27, 2018
c129176
[SPARK-19355][SQL][FOLLOWUP] Remove the child.outputOrdering check in…
viirya Aug 27, 2018
368b42f
[SPARK-24978][SQL] Add spark.sql.fast.hash.aggregate.row.max.capacity…
heary-cao Aug 27, 2018
0378b1f
[SPARK-25249][CORE][TEST] add a unit test for OpenHashMap
10110346 Aug 27, 2018
d5a953a
[SPARK-24882][FOLLOWUP] Fix flaky synchronization in Kafka tests.
jose-torres Aug 27, 2018
3598483
[SPARK-24149][YARN][FOLLOW-UP] Only get the delegation tokens of the …
wangyum Aug 27, 2018
397fa62
[SPARK-24090][K8S] Update running-on-kubernetes.md
liyinan926 Aug 27, 2018
dcd001b
[SPARK-24721][SQL] Exclude Python UDFs filters in FileSourceStrategy
icexelloss Aug 28, 2018
b23538b
[SPARK-25218][CORE] Fix potential resource leaks in TransportServer a…
zsxwing Aug 28, 2018
f769a94
[SPARK-25005][SS] Support non-consecutive offsets for Kafka
zsxwing Aug 28, 2018
68c41ff
comment
bogdanrdc Aug 28, 2018
dad6a7f
Merge remote-tracking branch 'upstream/master' into local-relation-fi…
bogdanrdc Aug 28, 2018
cb067c3
Merge remote-tracking branch 'upstream/master' into local-relation-fi…
bogdanrdc Aug 28, 2018
d552cc1
space
bogdanrdc Aug 28, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[SPARK-25234][SPARKR] avoid integer overflow in parallelize
## What changes were proposed in this pull request?

`parallelize` uses integer multiplication to determine the split indices. It might cause integer overflow.

## How was this patch tested?

unit test

Closes #22225 from mengxr/SPARK-25234.

Authored-by: Xiangrui Meng <[email protected]>
Signed-off-by: Xiangrui Meng <[email protected]>
  • Loading branch information
mengxr authored and bogdanrdc committed Aug 28, 2018
commit f8536e324ddb7cce769f8a5970e62743f7914fe8
9 changes: 4 additions & 5 deletions R/pkg/R/context.R
Original file line number Diff line number Diff line change
Expand Up @@ -138,11 +138,10 @@ parallelize <- function(sc, coll, numSlices = 1) {

sizeLimit <- getMaxAllocationLimit(sc)
objectSize <- object.size(coll)
len <- length(coll)

# For large objects we make sure the size of each slice is also smaller than sizeLimit
numSerializedSlices <- max(numSlices, ceiling(objectSize / sizeLimit))
if (numSerializedSlices > length(coll))
numSerializedSlices <- length(coll)
numSerializedSlices <- min(len, max(numSlices, ceiling(objectSize / sizeLimit)))

# Generate the slice ids to put each row
# For instance, for numSerializedSlices of 22, length of 50
Expand All @@ -153,8 +152,8 @@ parallelize <- function(sc, coll, numSlices = 1) {
splits <- if (numSerializedSlices > 0) {
unlist(lapply(0: (numSerializedSlices - 1), function(x) {
# nolint start
start <- trunc((x * length(coll)) / numSerializedSlices)
end <- trunc(((x + 1) * length(coll)) / numSerializedSlices)
start <- trunc((as.numeric(x) * len) / numSerializedSlices)
end <- trunc(((as.numeric(x) + 1) * len) / numSerializedSlices)
# nolint end
rep(start, end - start)
}))
Expand Down
7 changes: 7 additions & 0 deletions R/pkg/tests/fulltests/test_context.R
Original file line number Diff line number Diff line change
Expand Up @@ -240,3 +240,10 @@ test_that("add and get file to be downloaded with Spark job on every node", {
unlink(path, recursive = TRUE)
sparkR.session.stop()
})

test_that("SPARK-25234: parallelize should not have integer overflow", {
sc <- sparkR.sparkContext(master = sparkRTestMaster)
# 47000 * 47000 exceeds integer range
parallelize(sc, 1:47000, 47000)
sparkR.session.stop()
})