Persistent GPU Memory Buffer Handling #790

BI71317 · 2026-04-07T07:36:09Z

BI71317
Apr 7, 2026

Hi.

Summary

I wanted to raise a design question related to GPU memory handling in Codon.

Motivation

The motivation came up while running various benchmarks such as npbench.

What I realized is that, at the moment, there is no way to use a persistent device memory model.

Under the current structure, launching a kernel implicitly triggers host-to-device copies for the kernel arguments, and then copies data back afterward.

Because of that, it is difficult to benchmark or optimize the actual kernel execution separately from the host-device transfer overhead.

# in stdlib\internal\gpu.codon
def kernel_wrapper(...
...
    gpu_args = tuple(arg.__to_gpu__(cache) for arg in args)
    kernel_ptr = nvptx_function(static.function.realized(fn, *gpu_args).__llvm_name__)
    p = __ptr__(gpu_args).as_byte()
    arg_ptrs = tuple((p + offset) for offset in offsets(gpu_args))
    cuda_check(cuLaunchKernel(kernel_ptr,
                                u32(grid[0]), u32(grid[1]), u32(grid[2]),
                                u32(block[0]), u32(block[1]), u32(block[2]),
                                u32(shared_mem), cobj(),
                                __ptr__(arg_ptrs).as_byte(), cobj()))
    _tuple_from_gpu(args, gpu_args)
...

Expectation

What I would like is something closer to the explicit memory model used in CUDA Python ecosystems, for example:

import cupy as cp
import numpy as np

a = np.arange(10)
d_a = cuda.to_device(a)

More specifically, I want to separate device memory allocation and kernel launch explicitly, so that kernels can run without always paying the host-to-device copy cost.

Solution

My current thought is that, instead of redesigning the entire GPU interface, we could introduce something like a DeviceMemory class in gpu.codon that only holds a device pointer / handle.

Then, when to_gpu is called on such an object, unlike normal host-side types such as list or ndarray, it would not perform a copy but would simply return the underlying pointer or handle representation.

I wanted to ask what you think about this direction.

It seems like a relatively minimal extension to the current model, while still making persistent device allocation and copy-free kernel launches possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent GPU Memory Buffer Handling #790

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Persistent GPU Memory Buffer Handling #790

Uh oh!

BI71317 Apr 7, 2026

Summary

Motivation

Expectation

Solution

Replies: 0 comments

BI71317
Apr 7, 2026