Hang in CPU-assisted IBGDA mode

Hi DeepEP developers! 

I am trying to run DeepEP under CPU-assisted IBGDA mode following https://github.com/deepseek-ai/DeepEP/blob/main/third-party/README.md#22-install-gdrcopy-and-load-the-gdrdrv-kernel-module. `gdrcopy_sanity` and `nvshmrun -n 2 ./shmem_put_bw` work. But both LL and normal kernel test will hang after: 

```
WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.
WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.
WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.
Buffer initialized.
```
The servers are standard Nebius cloud H100+CX7 IB 2-node testbed, using `pip install nvidia-nvshmem-cu12` (should be version 3.4.5 iirc). Any thoughts or suggestions on this? My goal is basically to see the performance of LL and normal kernels with CPU-assisted IBGDA. 

Correct me if I was wrong: I saw some issues mentioned that the latest DeepEP does not support IBRC, but I guess it should support CPU-assisted IBGDA, which should have the same interface as native IBGDA. 

Another side question is: what is the security vulnerability specifically about when using native IBGDA? What is the root cause, and what are the security implications, especially in public cloud? I saw this blog https://developer.nvidia.com/blog/enhancing-application-portability-and-compatibility-across-new-platforms-using-nvidia-magnum-io-nvshmem-3-0/#cpu-assisted_infiniband_gpu_direct_async%C2%A0, but it does not dive deep into it... and I could not find a more detailed description about it 

Best,
Yang


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hang in CPU-assisted IBGDA mode #509

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hang in CPU-assisted IBGDA mode #509

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions