-
Notifications
You must be signed in to change notification settings - Fork 173
Description
In #1438, I said we could delay the address space to lowering, rather than emitting it in CIR. I found both the OG way and the new way quite difficult, so I'd like to raise some discussion about how to do it.
How OG does it
OG tries to infer the address space by type, or more precisely, __attribute__((address_space(n)))
. It defaults to 0 when that's not present (but for OpenCL, it seems to be opencl_global
for global variables and opencl_local
for static variables).
On CUDA the things are different. Take __device__
variables as an example. The macro __device__
is not __attribute__((address_space(1)))
-- rather, it's simply __attribute__((device))
. This means OG will always treat all variables as if they were stored in address space 0. This is sound (and required in many cases), as every address space can be converted to the generic space in CUDA, but this sometimes results in unnecessary casts:
__device__ int x; // address space 1
__device__ void f() {
int i; // address space 0
x = i; // ??
}
Ideally, this should translate to:
%2 = load i32, ptr %i.alloca, align 4
store i32 %2, ptr addrspace(1) @x, align 4
But in fact, LLVM thinks it should be in address space 0, so it will cast it into that:
%2 = load i32, ptr %i.alloca, align 4
store i32 %2, ptr addrspacecast (ptr addrspace(1) @x to ptr), align 4
Difficulties
This would require rework for a lot of places. Currently OG does this in getOrCreate{Global,Static}Op
which returns a llvm::Constant*
. In CIR these functions return a cir::GlobalOp
and we can't easily cast on this. We need to insert it after cir.get_global
, which is much harder to track (and I haven't come up with a way to track them).
We might redesign GlobalOp
so they carry both the expected address space and the attribute-decided one, and consider that in lowering cir.get_global
. Not sure if that's a good idea - I haven't tried yet.
An alternative way
We can defer inserting casts until we really need to. We might add a pass to detect places when a conversion is needed. In the future we might also be able to assign each variable a suitable address space, in order to achieve the fewest cvta
instructions in PTX. (Note NVPTX backend does have some capabilities of this; can we do better?)
Difficulties
First of all, this works only for CUDA. Currently it seems only CUDA will meet this problem where address space and type doesn't match, but I don't know if it would happen in future. Moreover, it's also not obvious how we detect places for address space casts.
Not normalising these address spaces might violate other invariants, which are hard to find out. For example, we currently emit a bitcast when we find a function receives an argument not the same type as the parameter. As ptr addrspace(1)
and ptr
are indeed different, it would try to emit a bitcast, which is wrong.