Skip to content
This repository was archived by the owner on Mar 21, 2024. It is now read-only.

Conversation

@jlebar
Copy link

@jlebar jlebar commented Sep 14, 2016

…-only.

clang is about to get pickier about disallowing references to things
from host+device code when it won't work on either host or device.

clang doesn't currently support launching kernels from the device side,
thus these __host__ __device__ functions that take a function pointer to
__global__ functions are no good.

It doesn't look like Thrust tries to use nested kernels, so the
__device__ attribute appears unnecessary here. But if you like we could
make these functions __host__ __device__ for compilers other than clang.

@jlebar
Copy link
Author

jlebar commented Sep 14, 2016

cc @Artem-B

@jaredhoberock
Copy link
Contributor

Thanks Justin. This won't quite work for us because Thrust does actually launch kernels from __device__ functions :-) It's possible to use nested Thrust algorithms when the compiler generates relocatable device code and the SM arch is >= sm_35.

I think what we ought to do to make this work for Clang is to retain the annotations of these functions but also guard these places where the address of a __global__ function is taken with a preprocessor macro. I think nvcc already defines a preprocessor symbol that indicates whether kernel launches are enabled for __device__ functions. Basically this macro (whose name I can't recall at the moment) indicates whether "Dynamic Parallelism" is available.

Does Clang also provide this macro?

@jlebar
Copy link
Author

jlebar commented Sep 14, 2016

This won't quite work for us because Thrust does actually launch kernels from device functions :-)

Interesting, I would have expected this not to compile with clang either, then. (We are building with sm_35.)

In any case, let's figure out that macro, if it exists. I don't see it in the nvcc docs [1], and I can't figure out how to print all #defines from nvcc [2]. Any other ideas?

[1] http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/
[2] With clang, it's -E -x clang -dM /dev/null

@3gx
Copy link
Contributor

3gx commented Sep 14, 2016

nvcc defines CUDACC_RDC macro in a separable compilation mode (-rdc=true). Together with CUDA_ARCH >= 350, it indicates that device-side kernel launches are permitted.

Thrust has unit tests that check device-side launches [1]. Device-side launches will be used only with -rdc=true -arch=sm_35 (or higher).

A macro like this [2] can be used to annotate a function. If device-side launches are supported, it will annotate a function with host device, otheriwse it will be just host

[1] https://github.com/thrust/thrust/blob/master/testing/backend/cuda/for_each.cu#L63
[2] https://github.com/NVlabs/cub/blob/1.5.2/cub/util_arch.cuh#L58

@jaredhoberock
Copy link
Contributor

jaredhoberock commented Sep 14, 2016

Thanks Evghenii.

I will probably be less disruptive on the overall code to retain the annotations (I'm worried that will lead to attempting to call a __host__ function from a __host__ __device__ function, for example). So I think all we need to do is to guard taking the address with something like #if defined(__CUDACC_RDC__). The #else branch can just return 0.

…-only.

clang is about to get pickier about disallowing references to things
from host+device code when it won't work on either host or device.

clang doesn't currently support launching kernels from the device side,
thus these __host__ __device functions that take a function pointer to
__global__ functions when __CUDA_RDC__ is not defined are no good.
@jlebar
Copy link
Author

jlebar commented Sep 14, 2016

I will probably be less disruptive on the overall code to retain the annotations (I'm worried that will lead to attempting to call a host function from a host device function, for example). So I think all we need to do is to guard taking the address with something like #if defined(CUDACC_RDC). The #else branch can just return 0.

WFM, updated the patch.

@jaredhoberock jaredhoberock merged commit 1ca24ce into NVIDIA:master Sep 14, 2016
{
// Don't try to take the address of launch_by_value from the device side if
// we don't support launching kernels from __device__ functions.
#if !defined(__CUDA_ARCH__) || defined(__CUDACC_RDC__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct condition should be

#if !defined(__CUDA_ARCH__) || (defined(__CUDACC_RDC__) && __CUDA_ARCH >= 350)

@jaredhoberock
Copy link
Contributor

Sorry, I should have waited for Evghenii's review before merging this.

@jlebar: Could you resubmit a PR with the corrected guard?

jlebar pushed a commit to jlebar/thrust that referenced this pull request Sep 15, 2016
…-only.

clang is about to get pickier about disallowing references to things
from host+device code when it won't work on either host or device.

clang doesn't currently support launching kernels from the device side,
thus these __host__ __device functions that take a function pointer to
__global__ functions when __CUDA_RDC__ is not defined are no good.

Originally landed as 98b4e16, reverted in 884d199 because the
condition was wrong. See
NVIDIA#831 (review)
@jlebar
Copy link
Author

jlebar commented Sep 15, 2016

#835

brycelelbach pushed a commit that referenced this pull request May 16, 2020
…-only.

clang is about to get pickier about disallowing references to things
from host+device code when it won't work on either host or device.

clang doesn't currently support launching kernels from the device side,
thus these __host__ __device functions that take a function pointer to
__global__ functions when __CUDA_RDC__ is not defined are no good.

Originally landed as 98b4e16, reverted in 884d199 because the
condition was wrong. See
#831 (review)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants