-
Notifications
You must be signed in to change notification settings - Fork 751
feat: add sgl deploy readme #2238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 4 commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
6addda5
feat(deploy): add README.md for SGLang Kubernetes deployment configur…
ishandhanani 883e80f
docs(sglang): update README with deployment examples for Kubernetes a…
ishandhanani c9e5c45
docs(deploy/README.md): fix formatting inconsistencies and enhance de…
ishandhanani f3c9619
docs(deploy): remove outdated health checks and clarify model configu…
ishandhanani bd8d2d9
docs(deploy): update links to guides in README for accurate navigation
ishandhanani File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,136 @@ | ||
| # SGLang Kubernetes Deployment Configurations | ||
|
|
||
| This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying SGLang inference graphs using the **DynamoGraphDeployment** resource. | ||
|
|
||
| ## Available Deployment Patterns | ||
|
|
||
| ### 1. **Aggregated Deployment** (`agg.yaml`) | ||
| Basic deployment pattern with frontend and a single decode worker. | ||
|
|
||
| **Architecture:** | ||
| - `Frontend`: OpenAI-compatible API server | ||
| - `SGLangDecodeWorker`: Single worker handling both prefill and decode | ||
|
|
||
| ### 2. **Aggregated Router Deployment** (`agg_router.yaml`) | ||
| Enhanced aggregated deployment with KV cache routing capabilities. | ||
|
|
||
| **Architecture:** | ||
| - `Frontend`: OpenAI-compatible API server with router mode enabled (`--router-mode kv`) | ||
| - `SGLangDecodeWorker`: Single worker handling both prefill and decode | ||
|
|
||
| ### 3. **Disaggregated Deployment** (`disagg.yaml`)** | ||
| High-performance deployment with separated prefill and decode workers. | ||
|
|
||
| **Architecture:** | ||
| - `Frontend`: HTTP API server coordinating between workers | ||
| - `SGLangDecodeWorker`: Specialized decode-only worker (`--disaggregation-mode decode`) | ||
| - `SGLangPrefillWorker`: Specialized prefill-only worker (`--disaggregation-mode prefill`) | ||
| - Communication via NIXL transfer backend (`--disaggregation-transfer-backend nixl`) | ||
|
|
||
| ## CRD Structure | ||
|
|
||
| All templates use the **DynamoGraphDeployment** CRD: | ||
|
|
||
| ```yaml | ||
| apiVersion: nvidia.com/v1alpha1 | ||
| kind: DynamoGraphDeployment | ||
| metadata: | ||
| name: <deployment-name> | ||
| spec: | ||
| services: | ||
| <ServiceName>: | ||
| # Service configuration | ||
| ``` | ||
|
|
||
| ### Key Configuration Options | ||
|
|
||
| **Resource Management:** | ||
| ```yaml | ||
| resources: | ||
| requests: | ||
| cpu: "10" | ||
| memory: "20Gi" | ||
| gpu: "1" | ||
| limits: | ||
| cpu: "10" | ||
| memory: "20Gi" | ||
| gpu: "1" | ||
| ``` | ||
|
|
||
| **Container Configuration:** | ||
| ```yaml | ||
| extraPodSpec: | ||
| mainContainer: | ||
| image: my-registry/sglang-runtime:my-tag | ||
| workingDir: /workspace/components/backends/sglang | ||
| args: | ||
| - "python3" | ||
| - "-m" | ||
| - "dynamo.sglang.worker" | ||
| # Model-specific arguments | ||
| ``` | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| Before using these templates, ensure you have: | ||
|
|
||
| 1. **Dynamo Cloud Platform installed** - See [Installing Dynamo Cloud](../../docs/guides/dynamo_deploy/dynamo_cloud.md) | ||
| 2. **Kubernetes cluster with GPU support** | ||
| 3. **Container registry access** for SGLang runtime images | ||
| 4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`) | ||
|
|
||
| ## Usage | ||
|
|
||
| ### 1. Choose Your Template | ||
| Select the deployment pattern that matches your requirements: | ||
| - Use `agg.yaml` for development/testing | ||
| - Use `agg_router.yaml` for production with load balancing | ||
| - Use `disagg.yaml` for maximum performance | ||
|
|
||
| ### 2. Customize Configuration | ||
| Edit the template to match your environment: | ||
|
|
||
| ```yaml | ||
| # Update image registry and tag | ||
| image: your-registry/sglang-runtime:your-tag | ||
|
|
||
| # Configure your model | ||
| args: | ||
| - "--model-path" | ||
| - "your-org/your-model" | ||
| - "--served-model-name" | ||
| - "your-org/your-model" | ||
| ``` | ||
|
|
||
| ### 3. Deploy | ||
| ```bash | ||
| kubectl apply -f <your-template>.yaml | ||
| ``` | ||
|
|
||
| ## Model Configuration | ||
|
|
||
| All templates use **DeepSeek-R1-Distill-Llama-8B** as the default model. But you can use any sglang argument and configuration. Key parameters: | ||
|
|
||
| ## Monitoring and Health | ||
|
|
||
| - **Frontend health endpoint**: `http://<frontend-service>:8000/health` | ||
| - **Liveness probes**: Check process health every 60s | ||
|
|
||
| ## Further Reading | ||
|
|
||
| - **Deployment Guide**: [Creating Kubernetes Deployments](../../docs/guides/dynamo_deploy/create_deployment.md) | ||
| - **Quickstart**: [Deployment Quickstart](../../docs/guides/dynamo_deploy/quickstart.md) | ||
| - **Platform Setup**: [Dynamo Cloud Installation](../../docs/guides/dynamo_deploy/dynamo_cloud.md) | ||
| - **Examples**: [Deployment Examples](../../docs/examples/README.md) | ||
| - **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| Common issues and solutions: | ||
|
|
||
| 1. **Pod fails to start**: Check image registry access and HuggingFace token secret | ||
| 2. **GPU not allocated**: Verify cluster has GPU nodes and proper resource limits | ||
| 3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds` | ||
| 4. **Out of memory**: Increase memory limits or reduce model batch size | ||
|
|
||
| For additional support, refer to the [deployment troubleshooting guide](../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting). | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.