[RFC][CUDA] Address space pass

In #1438, I said we could delay the address space to lowering, rather than emitting it in CIR. I found both the OG way and the new way quite difficult, so I'd like to raise some discussion about how to do it.

### How OG does it

OG tries to infer the address space by type, or more precisely, `__attribute__((address_space(n)))`. It defaults to 0 when that's not present (but for OpenCL, it seems to be `opencl_global` for global variables and `opencl_local` for static variables).

On CUDA the things are different. Take `__device__` variables as an example. The macro `__device__` is not `__attribute__((address_space(1)))` -- rather, it's simply `__attribute__((device))`. This means OG will always treat all variables as if they were stored in address space 0. This is sound (and required in many cases), as every address space can be converted to the generic space in CUDA, but this sometimes results in unnecessary casts:

```cpp
__device__ int x; // address space 1
__device__ void f() {
  int i; // address space 0
  x = i; // ??
}
```

Ideally, this should translate to:
```llvm
%2 = load i32, ptr %i.alloca, align 4
store i32 %2, ptr addrspace(1) @x, align 4
```

But in fact, LLVM thinks it should be in address space 0, so it will cast it into that:
```llvm
%2 = load i32, ptr %i.alloca, align 4
store i32 %2, ptr addrspacecast (ptr addrspace(1) @x to ptr), align 4
```

#### Difficulties

This would require rework for a lot of places. Currently OG does this in `getOrCreate{Global,Static}Op` which returns a `llvm::Constant*`. In CIR these functions return a `cir::GlobalOp` and we can't easily cast on this. We need to insert it after `cir.get_global`, which is much harder to track (and I haven't come up with a way to track them).

We might redesign `GlobalOp` so they carry both the expected address space and the attribute-decided one, and consider that in lowering `cir.get_global`. Not sure if that's a good idea - I haven't tried yet.

### An alternative way

We can defer inserting casts until we really need to. We might add a pass to detect places when a conversion is needed. In the future we might also be able to assign each variable a suitable address space, in order to achieve the fewest `cvta` instructions in PTX. (Note NVPTX backend does have some capabilities of this; can we do better?)

#### Difficulties

First of all, this works only for CUDA. Currently it seems only CUDA will meet this problem where address space and type doesn't match, but I don't know if it would happen in future. Moreover, it's also not obvious how we detect places for address space casts.

Not normalising these address spaces might violate other invariants, which are hard to find out. For example, we currently emit a bitcast when we find a function receives an argument not the same type as the parameter. As `ptr addrspace(1)` and `ptr` are indeed different, it would try to emit a bitcast, which is wrong.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC][CUDA] Address space pass #1490

How OG does it

Difficulties

An alternative way

Difficulties

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC][CUDA] Address space pass #1490

Description

How OG does it

Difficulties

An alternative way

Difficulties

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions