You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wanted to raise a design question related to GPU memory handling in Codon.
Motivation
The motivation came up while running various benchmarks such as npbench.
What I realized is that, at the moment, there is no way to use a persistent device memory model.
Under the current structure, launching a kernel implicitly triggers host-to-device copies for the kernel arguments, and then copies data back afterward.
Because of that, it is difficult to benchmark or optimize the actual kernel execution separately from the host-device transfer overhead.
What I would like is something closer to the explicit memory model used in CUDA Python ecosystems, for example:
import cupy as cp
import numpy as np
a = np.arange(10)
d_a = cuda.to_device(a)
More specifically, I want to separate device memory allocation and kernel launch explicitly, so that kernels can run without always paying the host-to-device copy cost.
Solution
My current thought is that, instead of redesigning the entire GPU interface, we could introduce something like a DeviceMemory class in gpu.codon that only holds a device pointer / handle.
Then, when to_gpu is called on such an object, unlike normal host-side types such as list or ndarray, it would not perform a copy but would simply return the underlying pointer or handle representation.
I wanted to ask what you think about this direction.
It seems like a relatively minimal extension to the current model, while still making persistent device allocation and copy-free kernel launches possible.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi.
Summary
I wanted to raise a design question related to GPU memory handling in Codon.
Motivation
The motivation came up while running various benchmarks such as npbench.
What I realized is that, at the moment, there is no way to use a persistent device memory model.
Under the current structure, launching a kernel implicitly triggers host-to-device copies for the kernel arguments, and then copies data back afterward.
Because of that, it is difficult to benchmark or optimize the actual kernel execution separately from the host-device transfer overhead.
Expectation
What I would like is something closer to the explicit memory model used in CUDA Python ecosystems, for example:
More specifically, I want to separate device memory allocation and kernel launch explicitly, so that kernels can run without always paying the host-to-device copy cost.
Solution
My current thought is that, instead of redesigning the entire GPU interface, we could introduce something like a
DeviceMemory classingpu.codonthat only holds a device pointer / handle.Then, when
to_gpuis called on such an object, unlike normal host-side types such aslistorndarray, it would not perform a copy but would simply return the underlying pointer or handle representation.I wanted to ask what you think about this direction.
It seems like a relatively minimal extension to the current model, while still making persistent device allocation and copy-free kernel launches possible.
Beta Was this translation helpful? Give feedback.
All reactions