Skip to content

Conversation

AdUhTkJm
Copy link
Contributor

@AdUhTkJm AdUhTkJm commented Mar 9, 2025

This PR deals with several issues currently present in CUDA CodeGen. Each of them requires only a few lines to fix, so they're combined in a single PR.

Bug 1.

Suppose we write

__global__ void kernel(int a, int b);

Then when we call this kernel with cudaLaunchKernel, the 4th argument to that function is something of the form void *kernel_args[2] = {&a, &b}. OG allocates the space of it with alloca ptr, i32 2, but that doesn't seem to be feasible in CIR, so we allocated alloca [2 x ptr], i32 1. This means there must be an extra GEP as compared to OG.

In CIR, it means we must add an array_to_ptrdecay cast before trying to accessing the array elements. I missed that out in #1332 .

Bug 2.

We missed a load instruction for 6th argument to cudaLaunchKernel. It's added back in this PR.

Bug 3.

When we launch a kernel, we first retrieve the return value of __cudaPopCallConfiguration. If it's zero, then the call succeeds and we should proceed to call the device stub. In #1348 we did exactly the opposite, calling the device stub only if it's not zero. It's fixed here.

Issue 4.

CallConvLowering is required to make cudaLaunchKernel correct. The codepath is unblocked by adding a getIndirectResult at the same place as OG does -- the function is already implemented so we can just call it.

After this (and other pending PRs), CIR is now able to compile real CUDA programs. There are still missing features, which will be followed up later.

@bcardosolopes bcardosolopes merged commit 182c680 into llvm:main Mar 11, 2025
6 checks passed
lanza pushed a commit that referenced this pull request Mar 18, 2025
This PR deals with several issues currently present in CUDA CodeGen.
Each of them requires only a few lines to fix, so they're combined in a
single PR.

**Bug 1.**

Suppose we write
```cpp
__global__ void kernel(int a, int b);
```

Then when we call this kernel with `cudaLaunchKernel`, the 4th argument
to that function is something of the form `void *kernel_args[2] = {&a,
&b}`. OG allocates the space of it with `alloca ptr, i32 2`, but that
doesn't seem to be feasible in CIR, so we allocated `alloca [2 x ptr],
i32 1`. This means there must be an extra GEP as compared to OG.

In CIR, it means we must add an `array_to_ptrdecay` cast before trying
to accessing the array elements. I missed that out in #1332 .

**Bug 2.**

We missed a load instruction for 6th argument to `cudaLaunchKernel`.
It's added back in this PR.

**Bug 3.** 

When we launch a kernel, we first retrieve the return value of
`__cudaPopCallConfiguration`. If it's zero, then the call succeeds and
we should proceed to call the device stub. In #1348 we did exactly the
opposite, calling the device stub only if it's not zero. It's fixed
here.

**Issue 4.**

CallConvLowering is required to make `cudaLaunchKernel` correct. The
codepath is unblocked by adding a `getIndirectResult` at the same place
as OG does -- the function is already implemented so we can just call
it.


After this (and other pending PRs), CIR is now able to compile real CUDA
programs. There are still missing features, which will be followed up
later.
terapines-osc-cir pushed a commit to Terapines/clangir that referenced this pull request Sep 2, 2025
This PR deals with several issues currently present in CUDA CodeGen.
Each of them requires only a few lines to fix, so they're combined in a
single PR.

**Bug 1.**

Suppose we write
```cpp
__global__ void kernel(int a, int b);
```

Then when we call this kernel with `cudaLaunchKernel`, the 4th argument
to that function is something of the form `void *kernel_args[2] = {&a,
&b}`. OG allocates the space of it with `alloca ptr, i32 2`, but that
doesn't seem to be feasible in CIR, so we allocated `alloca [2 x ptr],
i32 1`. This means there must be an extra GEP as compared to OG.

In CIR, it means we must add an `array_to_ptrdecay` cast before trying
to accessing the array elements. I missed that out in llvm#1332 .

**Bug 2.**

We missed a load instruction for 6th argument to `cudaLaunchKernel`.
It's added back in this PR.

**Bug 3.** 

When we launch a kernel, we first retrieve the return value of
`__cudaPopCallConfiguration`. If it's zero, then the call succeeds and
we should proceed to call the device stub. In llvm#1348 we did exactly the
opposite, calling the device stub only if it's not zero. It's fixed
here.

**Issue 4.**

CallConvLowering is required to make `cudaLaunchKernel` correct. The
codepath is unblocked by adding a `getIndirectResult` at the same place
as OG does -- the function is already implemented so we can just call
it.


After this (and other pending PRs), CIR is now able to compile real CUDA
programs. There are still missing features, which will be followed up
later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants