Skip to content

Hang in CPU-assisted IBGDA mode #509

@YangZhou1997

Description

@YangZhou1997

Hi DeepEP developers!

I am trying to run DeepEP under CPU-assisted IBGDA mode following https://github.com/deepseek-ai/DeepEP/blob/main/third-party/README.md#22-install-gdrcopy-and-load-the-gdrdrv-kernel-module. gdrcopy_sanity and nvshmrun -n 2 ./shmem_put_bw work. But both LL and normal kernel test will hang after:

WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.
WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.
WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.
Buffer initialized.

The servers are standard Nebius cloud H100+CX7 IB 2-node testbed, using pip install nvidia-nvshmem-cu12 (should be version 3.4.5 iirc). Any thoughts or suggestions on this? My goal is basically to see the performance of LL and normal kernels with CPU-assisted IBGDA.

Correct me if I was wrong: I saw some issues mentioned that the latest DeepEP does not support IBRC, but I guess it should support CPU-assisted IBGDA, which should have the same interface as native IBGDA.

Another side question is: what is the security vulnerability specifically about when using native IBGDA? What is the root cause, and what are the security implications, especially in public cloud? I saw this blog https://developer.nvidia.com/blog/enhancing-application-portability-and-compatibility-across-new-platforms-using-nvidia-magnum-io-nvshmem-3-0/#cpu-assisted_infiniband_gpu_direct_async%C2%A0, but it does not dive deep into it... and I could not find a more detailed description about it

Best,
Yang

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions