-
Notifications
You must be signed in to change notification settings - Fork 753
fix: prevent crash looping hello world #2625 #2670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 1 commit
Commits
Show all changes
43 commits
Select commit
Hold shift + click to select a range
992adfb
fix: add better port logic (#2175) (#2192)
alec-flowers 9a93f11
chore: fix install (#2191)
ishandhanani 2a616da
chore: fix QA bugs in documentation/readmes (#2199)
athreesh d0de1a0
feat: Add trtllm deploy examples for k8s #2133 (#2207)
biswapanda edccbd5
fix(sglang): disagg yaml worker change and agg kv router fix (#2205)
ishandhanani 54fbff3
fix: add curl and jq for health checks #2203 (#2209)
biswapanda a9b6b28
fix: Kprashanth/trtllm rc4 cherry pick (#2218)
KrishnanPrash 65e89b3
chore: cleanup dead links (#2208)
nealvaidya c92dc98
chore: update nixl version to 0.4.1 (#2221) (#2228)
nv-anants eb58916
chore: Remove multimodal readme. (#2212) (#2234)
krishung5 e848cf5
fix: Cherry pick pr 2186 release 0.4.0 to fix docs/runtime/README.md …
keivenchang 5e3586d
fix: drop cuda graph bs (batch size) on dsr1 h100 sgl (#2235)
ishandhanani 4fbb4e5
fix: handle groveTerminationDelay and auto-detect grove installation …
julienmancuso dc13774
fix: Locked triton==3.3.1 since triton 3.4.0 breaks tensorrt-llm 1.0.…
dmitry-tokarev-nv e5e94ad
fix: sgl instructions point to new frontend (#2245)
ishandhanani 92781d3
fix: Update disagg configs for trtllm 1.0.0rc4 changes (release/0.4.0…
rmccorm4 58ad4a2
fix: readme instruction (#2265)
ishandhanani 039c061
fix: Update eagle_one configs with speculative_model_dir field (#2283)
rmccorm4 2a8e251
docs: Backport: Dyn 591 (#2247) to 0.4.0 (#2251)
atchernych 2dc4a4b
fix: trtllm container - ENV var used before declaration (#2277)
dmitry-tokarev-nv 85737ba
fix: Update the NIXL TRTLLM commit version to rc4 (#2285)
tanmayv25 27c8a97
docs: add instruction to deploy model with inference gateway #2257 (#…
biswapanda 641e49d
fix: fix nil pointer deref in dynamo controller (#2293) (#2299)
mohammedabdulwahhab 1b145bb
fix: fix broken doc links (#2308)
biswapanda 4e4818f
fix: Copy cuda libraries from devel to runtime stage (#2298)
nv-tusharma c92c1f4
docs: update deploy readme (#2306)
atchernych 6fce98a
fix: Add common and test dependencies to sglang runtime build (#2279)…
nv-tusharma 035d6d8
fix: Revert the commit for DeepGEMM to fix vLLM WideEP (#2302) (#2325)
krishung5 167c793
fix: Backport/anish index rst into 0.4.0 - fix links in docs and more…
athreesh 409aa9e
docs: Final fixes to links reported by QA (#2334)
athreesh 71126c7
fix: nil pointer deref in dynamo controller (#2335)
mohammedabdulwahhab f342c30
docs: address sphinx build errors for docs.nvidia.com (#2346)
athreesh 96d1f15
docs: Address vincent issue with trtllm symlink (#2351)
athreesh e8b37a6
fix: ARM Flashinfer Versioning for 0.4.0 Release (#2363)
zaristei b5c9278
fix: Pinned PyTorch version for vLLM container (#2356)
krishung5 b0c1a24
chore: ATTRIBUTIONS-Go.md (#2355)
dmitry-tokarev-nv 0cf8041
Revert "adjust tag to accomodate flashinfer versioning typo" (#2364)
zaristei bd8e368
fix: use wheel files for installation in trtllm build (#2372) (#2375)
nv-anants 73bcc3b
fix(build): Pin cuda-python>=12,<13 to avoid trtllm breakage (#2379)
rmccorm4 aa57c6b
fix: turn off kvbm for al2023 support (#2533)
saturley-hall 3f0a725
docs: add trtllm known issue for al2023 (#2604) (#2612)
nv-anants d98a791
docs: update trtllm know issue message (#2639) (#2643)
nv-anants 37fca1c
fix: prevent crash looping hello world (#2625)
biswapanda File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
…2260) Signed-off-by: Biswa Panda <[email protected]>
- Loading branch information
commit 27c8a97fc1e88ecdb0bc3a07a7f5bd245cc7ccfb
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,162 @@ | ||
| # SGLang Kubernetes Deployment Configurations | ||
|
|
||
| This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying SGLang inference graphs using the **DynamoGraphDeployment** resource. | ||
|
|
||
| ## Available Deployment Patterns | ||
|
|
||
| ### 1. **Aggregated Deployment** (`agg.yaml`) | ||
| Basic deployment pattern with frontend and a single decode worker. | ||
|
|
||
| **Architecture:** | ||
| - `Frontend`: OpenAI-compatible API server | ||
| - `SGLangDecodeWorker`: Single worker handling both prefill and decode | ||
|
|
||
| ### 2. **Aggregated Router Deployment** (`agg_router.yaml`) | ||
| Enhanced aggregated deployment with KV cache routing capabilities. | ||
|
|
||
| **Architecture:** | ||
| - `Frontend`: OpenAI-compatible API server with router mode enabled (`--router-mode kv`) | ||
| - `SGLangDecodeWorker`: Single worker handling both prefill and decode | ||
|
|
||
| ### 3. **Disaggregated Deployment** (`disagg.yaml`)** | ||
| High-performance deployment with separated prefill and decode workers. | ||
|
|
||
| **Architecture:** | ||
| - `Frontend`: HTTP API server coordinating between workers | ||
| - `SGLangDecodeWorker`: Specialized decode-only worker (`--disaggregation-mode decode`) | ||
| - `SGLangPrefillWorker`: Specialized prefill-only worker (`--disaggregation-mode prefill`) | ||
| - Communication via NIXL transfer backend (`--disaggregation-transfer-backend nixl`) | ||
|
|
||
| ## CRD Structure | ||
|
|
||
| All templates use the **DynamoGraphDeployment** CRD: | ||
|
|
||
| ```yaml | ||
| apiVersion: nvidia.com/v1alpha1 | ||
| kind: DynamoGraphDeployment | ||
| metadata: | ||
| name: <deployment-name> | ||
| spec: | ||
| services: | ||
| <ServiceName>: | ||
| # Service configuration | ||
| ``` | ||
|
|
||
| ### Key Configuration Options | ||
|
|
||
| **Resource Management:** | ||
| ```yaml | ||
| resources: | ||
| requests: | ||
| cpu: "10" | ||
| memory: "20Gi" | ||
| gpu: "1" | ||
| limits: | ||
| cpu: "10" | ||
| memory: "20Gi" | ||
| gpu: "1" | ||
| ``` | ||
|
|
||
| **Container Configuration:** | ||
| ```yaml | ||
| extraPodSpec: | ||
| mainContainer: | ||
| image: my-registry/sglang-runtime:my-tag | ||
| workingDir: /workspace/components/backends/sglang | ||
| args: | ||
| - "python3" | ||
| - "-m" | ||
| - "dynamo.sglang.worker" | ||
| # Model-specific arguments | ||
| ``` | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| Before using these templates, ensure you have: | ||
|
|
||
| 1. **Dynamo Cloud Platform installed** - See [Installing Dynamo Cloud](../../docs/guides/dynamo_deploy/dynamo_cloud.md) | ||
| 2. **Kubernetes cluster with GPU support** | ||
| 3. **Container registry access** for SGLang runtime images | ||
| 4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`) | ||
|
|
||
| ## Usage | ||
|
|
||
| ### 1. Choose Your Template | ||
| Select the deployment pattern that matches your requirements: | ||
| - Use `agg.yaml` for development/testing | ||
| - Use `agg_router.yaml` for production with load balancing | ||
| - Use `disagg.yaml` for maximum performance | ||
|
|
||
| ### 2. Customize Configuration | ||
| Edit the template to match your environment: | ||
|
|
||
| ```yaml | ||
| # Update image registry and tag | ||
| image: your-registry/sglang-runtime:your-tag | ||
|
|
||
| # Configure your model | ||
| args: | ||
| - "--model-path" | ||
| - "your-org/your-model" | ||
| - "--served-model-name" | ||
| - "your-org/your-model" | ||
| ``` | ||
|
|
||
| ### 3. Deploy | ||
|
|
||
| Use the following command to deploy the deployment file. | ||
|
|
||
| First, create a secret for the HuggingFace token. | ||
| ```bash | ||
| export HF_TOKEN=your_hf_token | ||
| kubectl create secret generic hf-token-secret \ | ||
| --from-literal=HF_TOKEN=${HF_TOKEN} \ | ||
| -n ${NAMESPACE} | ||
| ``` | ||
|
|
||
| Then, deploy the model using the deployment file. | ||
|
|
||
| ```bash | ||
| export DEPLOYMENT_FILE=agg.yaml | ||
| kubectl apply -f $DEPLOYMENT_FILE -n ${NAMESPACE} | ||
| ``` | ||
|
|
||
| ### 4. Using Custom Dynamo Frameworks Image for SGLang | ||
|
|
||
| To use a custom dynamo frameworks image for SGLang, you can update the deployment file using yq: | ||
|
|
||
| ```bash | ||
| export DEPLOYMENT_FILE=agg.yaml | ||
| export FRAMEWORK_RUNTIME_IMAGE=<sglang-image> | ||
|
|
||
| yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE > $DEPLOYMENT_FILE.generated | ||
| kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE | ||
| ``` | ||
|
|
||
| ## Model Configuration | ||
|
|
||
| All templates use **DeepSeek-R1-Distill-Llama-8B** as the default model. But you can use any sglang argument and configuration. Key parameters: | ||
|
|
||
| ## Monitoring and Health | ||
|
|
||
| - **Frontend health endpoint**: `http://<frontend-service>:8000/health` | ||
| - **Liveness probes**: Check process health every 60s | ||
|
|
||
| ## Further Reading | ||
|
|
||
| - **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/guides/dynamo_deploy/create_deployment.md) | ||
| - **Quickstart**: [Deployment Quickstart](../../../../docs/guides/dynamo_deploy/quickstart.md) | ||
| - **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md) | ||
| - **Examples**: [Deployment Examples](../../../../docs/examples/README.md) | ||
| - **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| Common issues and solutions: | ||
|
|
||
| 1. **Pod fails to start**: Check image registry access and HuggingFace token secret | ||
| 2. **GPU not allocated**: Verify cluster has GPU nodes and proper resource limits | ||
| 3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds` | ||
| 4. **Out of memory**: Increase memory limits or reduce model batch size | ||
|
|
||
| For additional support, refer to the [deployment troubleshooting guide](../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting). | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define $NAMESPACE before use.
Add export to prevent kubectl errors when copying commands.
First, create a secret for the HuggingFace token. ```bash +export NAMESPACE=<your-k8s-namespace> export HF_TOKEN=your_hf_token kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN=${HF_TOKEN} \ -n ${NAMESPACE}In components/backends/sglang/deploy/README.md around lines 109 to 116, the
snippet uses ${NAMESPACE} but never defines or exports it and could cause
kubectl errors when copied; update the docs to instruct users to define and
export the NAMESPACE variable first (e.g., export
NAMESPACE=) before exporting HF_TOKEN and running kubectl
create secret so the commands work when copied into a shell.