|
| 1 | +# Example: Deploy Multi-node SGLang with Dynamo on SLURM |
| 2 | + |
| 3 | +This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](../dsr1-wideep.md) on a SLURM cluster. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](../dsr1-wideep.md) example, with separate nodes handling prefill and decode. |
| 8 | +The node setup is done using Python job submission scripts with Jinja2 templates for flexible configuration. The setup also includes GPU utilization monitoring capabilities to track performance during benchmarks. |
| 9 | + |
| 10 | +## Scripts |
| 11 | + |
| 12 | +- **`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates |
| 13 | +- **`job_script_template.j2`**: Jinja2 template for generating SLURM job scripts |
| 14 | +- **`scripts/worker_setup.py`**: Worker script that handles the setup on each node |
| 15 | +- **`scripts/monitor_gpu_utilization.sh`**: Script for monitoring GPU utilization during benchmarks |
| 16 | + |
| 17 | +## Logs Folder Structure |
| 18 | + |
| 19 | +Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`. |
| 20 | + |
| 21 | +### Log File Structure |
| 22 | + |
| 23 | +``` |
| 24 | +logs/ |
| 25 | +├── 3062824/ # Job ID directory |
| 26 | +│ ├── log.out # Main job output (node allocation, IP addresses, launch commands) |
| 27 | +│ ├── log.err # Main job errors |
| 28 | +│ ├── node0197_prefill.out # Prefill node stdout (node0197) |
| 29 | +│ ├── node0197_prefill.err # Prefill node stderr (node0197) |
| 30 | +│ ├── node0200_prefill.out # Prefill node stdout (node0200) |
| 31 | +│ ├── node0200_prefill.err # Prefill node stderr (node0200) |
| 32 | +│ ├── node0201_decode.out # Decode node stdout (node0201) |
| 33 | +│ ├── node0201_decode.err # Decode node stderr (node0201) |
| 34 | +│ ├── node0204_decode.out # Decode node stdout (node0204) |
| 35 | +│ ├── node0204_decode.err # Decode node stderr (node0204) |
| 36 | +│ ├── node0197_prefill_gpu_utilization.log # GPU utilization monitoring (node0197) |
| 37 | +│ ├── node0200_prefill_gpu_utilization.log # GPU utilization monitoring (node0200) |
| 38 | +│ ├── node0201_decode_gpu_utilization.log # GPU utilization monitoring (node0201) |
| 39 | +│ └── node0204_decode_gpu_utilization.log # GPU utilization monitoring (node0204) |
| 40 | +├── 3063137/ # Another job ID directory |
| 41 | +├── 3062689/ # Another job ID directory |
| 42 | +└── ... |
| 43 | +``` |
| 44 | + |
| 45 | +## Setup |
| 46 | + |
| 47 | +For simplicity of the example, we will make some assumptions about your SLURM cluster: |
| 48 | +1. We assume you have access to a SLURM cluster with multiple GPU nodes |
| 49 | + available. For functional testing, most setups should be fine. For performance |
| 50 | + testing, you should aim to allocate groups of nodes that are performantly |
| 51 | + inter-connected, such as those in an NVL72 setup. |
| 52 | +2. We assume this SLURM cluster has the [Pyxis](https://github.com/NVIDIA/pyxis) |
| 53 | + SPANK plugin setup. In particular, the `job_script_template.j2` template in this |
| 54 | + example will use `srun` arguments like `--container-image`, |
| 55 | + `--container-mounts`, and `--container-env` that are added to `srun` by Pyxis. |
| 56 | + If your cluster supports similar container based plugins, you may be able to |
| 57 | + modify the template to use that instead. |
| 58 | +3. We assume you have already built a recent Dynamo+SGLang container image as |
| 59 | + described [here](../dsr1-wideep.md#instructions). |
| 60 | + This is the image that can be passed to the `--container-image` argument in later steps. |
| 61 | + |
| 62 | +## Usage |
| 63 | + |
| 64 | +1. **Submit a benchmark job**: |
| 65 | + ```bash |
| 66 | + python submit_job_script.py \ |
| 67 | + --template job_script_template.j2 \ |
| 68 | + --model-dir /path/to/model \ |
| 69 | + --config-dir /path/to/configs \ |
| 70 | + --container-image container-image-uri \ |
| 71 | + --account your-slurm-account |
| 72 | + ``` |
| 73 | + |
| 74 | + **Required arguments**: |
| 75 | + - `--template`: Path to Jinja2 template file |
| 76 | + - `--model-dir`: Model directory path |
| 77 | + - `--config-dir`: Config directory path |
| 78 | + - `--container-image`: Container image URI (e.g., `registry/repository:tag`) |
| 79 | + - `--account`: SLURM account |
| 80 | + |
| 81 | + **Optional arguments**: |
| 82 | + - `--prefill-nodes`: Number of prefill nodes (default: `2`) |
| 83 | + - `--decode-nodes`: Number of decode nodes (default: `2`) |
| 84 | + - `--gpus-per-node`: Number of GPUs per node (default: `8`) |
| 85 | + - `--network-interface`: Network interface to use (default: `eth3`) |
| 86 | + - `--job-name`: SLURM job name (default: `dynamo_setup`) |
| 87 | + - `--time-limit`: Time limit in HH:MM:SS format (default: `01:00:00`) |
| 88 | + |
| 89 | + **Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters. |
| 90 | + |
| 91 | +2. **Monitor job progress**: |
| 92 | + ```bash |
| 93 | + squeue -u $USER |
| 94 | + ``` |
| 95 | + |
| 96 | +3. **Check logs in real-time**: |
| 97 | + ```bash |
| 98 | + tail -f logs/{JOB_ID}/log.out |
| 99 | + ``` |
| 100 | + |
| 101 | +4. **Monitor GPU utilization**: |
| 102 | + ```bash |
| 103 | + tail -f logs/{JOB_ID}/{node}_prefill_gpu_utilization.log |
| 104 | + ``` |
| 105 | + |
| 106 | +## Outputs |
| 107 | + |
| 108 | +Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container. |
0 commit comments