Skip to content

[RFC][CUDA] Address space pass #1490

@AdUhTkJm

Description

@AdUhTkJm

In #1438, I said we could delay the address space to lowering, rather than emitting it in CIR. I found both the OG way and the new way quite difficult, so I'd like to raise some discussion about how to do it.

How OG does it

OG tries to infer the address space by type, or more precisely, __attribute__((address_space(n))). It defaults to 0 when that's not present (but for OpenCL, it seems to be opencl_global for global variables and opencl_local for static variables).

On CUDA the things are different. Take __device__ variables as an example. The macro __device__ is not __attribute__((address_space(1))) -- rather, it's simply __attribute__((device)). This means OG will always treat all variables as if they were stored in address space 0. This is sound (and required in many cases), as every address space can be converted to the generic space in CUDA, but this sometimes results in unnecessary casts:

__device__ int x; // address space 1
__device__ void f() {
  int i; // address space 0
  x = i; // ??
}

Ideally, this should translate to:

%2 = load i32, ptr %i.alloca, align 4
store i32 %2, ptr addrspace(1) @x, align 4

But in fact, LLVM thinks it should be in address space 0, so it will cast it into that:

%2 = load i32, ptr %i.alloca, align 4
store i32 %2, ptr addrspacecast (ptr addrspace(1) @x to ptr), align 4

Difficulties

This would require rework for a lot of places. Currently OG does this in getOrCreate{Global,Static}Op which returns a llvm::Constant*. In CIR these functions return a cir::GlobalOp and we can't easily cast on this. We need to insert it after cir.get_global, which is much harder to track (and I haven't come up with a way to track them).

We might redesign GlobalOp so they carry both the expected address space and the attribute-decided one, and consider that in lowering cir.get_global. Not sure if that's a good idea - I haven't tried yet.

An alternative way

We can defer inserting casts until we really need to. We might add a pass to detect places when a conversion is needed. In the future we might also be able to assign each variable a suitable address space, in order to achieve the fewest cvta instructions in PTX. (Note NVPTX backend does have some capabilities of this; can we do better?)

Difficulties

First of all, this works only for CUDA. Currently it seems only CUDA will meet this problem where address space and type doesn't match, but I don't know if it would happen in future. Moreover, it's also not obvious how we detect places for address space casts.

Not normalising these address spaces might violate other invariants, which are hard to find out. For example, we currently emit a bitcast when we find a function receives an argument not the same type as the parameter. As ptr addrspace(1) and ptr are indeed different, it would try to emit a bitcast, which is wrong.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions