-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Hi DeepEP developers!
I am trying to run DeepEP under CPU-assisted IBGDA mode following https://github.com/deepseek-ai/DeepEP/blob/main/third-party/README.md#22-install-gdrcopy-and-load-the-gdrdrv-kernel-module. gdrcopy_sanity and nvshmrun -n 2 ./shmem_put_bw work. But both LL and normal kernel test will hang after:
WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.
WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.
WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.
Buffer initialized.
The servers are standard Nebius cloud H100+CX7 IB 2-node testbed, using pip install nvidia-nvshmem-cu12 (should be version 3.4.5 iirc). Any thoughts or suggestions on this? My goal is basically to see the performance of LL and normal kernels with CPU-assisted IBGDA.
Correct me if I was wrong: I saw some issues mentioned that the latest DeepEP does not support IBRC, but I guess it should support CPU-assisted IBGDA, which should have the same interface as native IBGDA.
Another side question is: what is the security vulnerability specifically about when using native IBGDA? What is the root cause, and what are the security implications, especially in public cloud? I saw this blog https://developer.nvidia.com/blog/enhancing-application-portability-and-compatibility-across-new-platforms-using-nvidia-magnum-io-nvshmem-3-0/#cpu-assisted_infiniband_gpu_direct_async%C2%A0, but it does not dive deep into it... and I could not find a more detailed description about it
Best,
Yang