ai-dynamo · hhzhang16 · Jun 4, 2025 · May 30, 2025 · Jun 2, 2025 · Jun 2, 2025
@@ -225,7 +225,8 @@ dynamo deployment create $DYNAMO_TAG -n $DEPLOYMENT_NAME -f ./configs/agg.yaml
 
 ### Testing the Deployment
 
-Once the deployment is complete, you can test it using:
+Once the deployment is complete, you can test it. If you have ingress available for your deployment, you can directly call the url returned
+in `dynamo deployment get ${DEPLOYMENT_NAME}` and skip the steps to find and forward the frontend pod.
 
 ```bash
 # Find your frontend pod

@@ -26,6 +26,7 @@ Frontend:
 Processor:
   router: round-robin
   common-configs: [model, block-size]
+  prompt-template: "USER: <image>\n<prompt> ASSISTANT:"
 
 VllmWorker:
   remote-prefill: true

@@ -97,7 +97,7 @@ You should see a response similar to this:
 - processor: Tokenizes the prompt and passes it to the decode worker.
 - frontend: HTTP endpoint to handle incoming requests.
 
-### Deployment
+### Local Serving
 
 In this deployment, we have three workers, [encode_worker](components/encode_worker.py), [decode_worker](components/decode_worker.py), and [prefill_worker](components/prefill_worker.py).
 For the Llava model, embeddings are only required during the prefill stage. As such, the encode worker is connected directly to the prefill worker.
@@ -158,3 +158,82 @@ You should see a response similar to this:
 ```json
 {"id": "c1774d61-3299-4aa3-bea1-a0af6c055ba8", "object": "chat.completion", "created": 1747725645, "model": "llava-hf/llava-1.5-7b-hf", "choices": [{"index": 0, "message": {"role": "assistant", "content": " This image shows a passenger bus traveling down the road near power lines and trees. The bus displays a sign that says \"OUT OF SERVICE\" on its front."}, "finish_reason": "stop"}]}
 ```
+
+## Deployment with Dynamo Operator
+
+These multimodal examples can be deployed to a Kubernetes cluster using [Dynamo Cloud](../../docs/guides/dynamo_deploy/dynamo_cloud.md) and the Dynamo CLI.
+
+### Prerequisites
+
+You must have first followed the instructions in [deploy/cloud/helm/README.md](../../deploy/cloud/helm/README.md) to install Dynamo Cloud on your Kubernetes cluster.
+
+**Note**: The `KUBE_NS` variable in the following steps must match the Kubernetes namespace where you installed Dynamo Cloud. You must also expose the `dynamo-store` service externally. This will be the endpoint the CLI uses to interface with Dynamo Cloud.
+
+### Deployment Steps
+
+For detailed deployment instructions, please refer to the [Operator Deployment Guide](../../docs/guides/dynamo_deploy/operator_deployment.md). The following are the specific commands for the multimodal examples:
+
+```bash
+# Set your project root directory
+export PROJECT_ROOT=$(pwd)
+
+# Configure environment variables (see operator_deployment.md for details)
+export KUBE_NS=dynamo-cloud
+export DYNAMO_CLOUD=http://localhost:8080  # If using port-forward
+# OR
+# export DYNAMO_CLOUD=https://dynamo-cloud.nvidia.com  # If using Ingress/VirtualService
+
+# Build the Dynamo base image (see operator_deployment.md for details)
+export DYNAMO_IMAGE=<your-registry>/<your-image-name>:<your-tag>
+
+# Build the service
+cd $PROJECT_ROOT/examples/multimodal
+DYNAMO_TAG=$(dynamo build graphs.agg:Frontend | grep "Successfully built" |  awk '{ print $NF }' | sed 's/\.$//')
+# For disaggregated serving:
+# DYNAMO_TAG=$(dynamo build graphs.disagg:Frontend | grep "Successfully built" |  awk '{ print $NF }' | sed 's/\.$//')
+
+# Deploy to Kubernetes
+export DEPLOYMENT_NAME=multimodal-agg
+# For aggregated serving:
+dynamo deploy $DYNAMO_TAG -n $DEPLOYMENT_NAME -f ./configs/agg.yaml
+# For disaggregated serving:
+# export DEPLOYMENT_NAME=multimodal-disagg
+# dynamo deploy $DYNAMO_TAG -n $DEPLOYMENT_NAME -f ./configs/disagg.yaml
+```
+
+**Note**: To avoid rate limiting from unauthenticated requests to HuggingFace (HF), you can provide your `HF_TOKEN` as a secret in your deployment. See the [operator deployment guide](../../docs/guides/dynamo_deploy/operator_deployment.md#referencing-secrets-in-your-deployment) for instructions on referencing secrets like `HF_TOKEN` in your deployment configuration.
+
+**Note**: Optionally add `--Planner.no-operation=false` at the end of the deployment command to enable the planner component to take scaling actions on your deployment.
+
+### Testing the Deployment
+
+Once the deployment is complete, you can test it. If you have ingress available for your deployment, you can directly call the url returned
+in `dynamo deployment get ${DEPLOYMENT_NAME}` and skip the steps to find and forward the frontend pod.
+
+```bash
+# Find your frontend pod
+export FRONTEND_POD=$(kubectl get pods -n ${KUBE_NS} | grep "${DEPLOYMENT_NAME}-frontend" | sort -k1 | tail -n1 | awk '{print $1}')
+
+# Forward the pod's port to localhost
+kubectl port-forward pod/$FRONTEND_POD 8000:8000 -n ${KUBE_NS}
+
+# Test the API endpoint
+curl localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llava-hf/llava-1.5-7b-hf",
+    "messages": [
+      {
+        "role": "user",
+        "content": [
+          { "type": "text", "text": "What is in this image?" },
+          { "type": "image_url", "image_url": { "url": "http://images.cocodataset.org/test2017/000000155781.jpg" } }
+        ]
+      }
+    ],
+    "max_tokens": 300,
+    "stream": false
+  }'
+```
+
+For more details on managing deployments, testing, and troubleshooting, please refer to the [Operator Deployment Guide](../../docs/guides/dynamo_deploy/operator_deployment.md).
@@ -135,8 +135,10 @@ async def async_init(self):
                 self.disaggregated_router = None
 
             model = LlavaForConditionalGeneration.from_pretrained(
-                self.engine_args.model
-            )
+                self.engine_args.model,
+                device_map="auto",
+                torch_dtype=torch.bfloat16,
+            ).eval()
             vision_tower = model.vision_tower
             self.embedding_size = (
                 vision_tower.vision_model.embeddings.position_embedding.num_embeddings

@@ -246,8 +246,8 @@ async def generate(self, request: RemotePrefillRequest):
                 self._loaded_metadata.add(engine_id)
 
             # To make sure the decode worker can pre-allocate the memory with the correct size for the prefill worker to transfer the kv cache,
-            # some placeholder dummy tokens were inserted based on the embedding size in the worker.py.
-            # The structure of the prompt is "\nUSER: <image> <dummy_tokens>\n<user_prompt>\nASSISTANT:", need to remove the dummy tokens after the image token.
+            # some placeholder dummy tokens are inserted based on the embedding size in the worker.py.
+            # TODO: make this more flexible/model-dependent
             IMAGE_TOKEN_ID = 32000
             embedding_size = embeddings.shape[1]
             padding_size = embedding_size - 1

@@ -188,11 +188,12 @@ async def _generate_responses(
     # The generate endpoint will be used by the frontend to handle incoming requests.
     @endpoint()
     async def generate(self, raw_request: MultiModalRequest):
+        prompt = str(self.engine_args.prompt_template).replace(
+            "<prompt>", raw_request.messages[0].content[0].text
+        )
         msg = {
             "role": "user",
-            "content": "USER: <image>\nQuestion:"
-            + raw_request.messages[0].content[0].text
-            + " Answer:",
+            "content": prompt,
         }
 
         chat_request = ChatCompletionRequest(

@@ -19,6 +19,7 @@ Common:
 
 Processor:
   router: round-robin
+  prompt-template: "USER: <image>\n<prompt> ASSISTANT:"
   common-configs: [model, block-size, max-model-len]
 
 VllmDecodeWorker:
@@ -30,7 +31,7 @@ VllmDecodeWorker:
   ServiceArgs:
     workers: 1
     resources:
-      gpu: 1
+      gpu: '1'
   common-configs: [model, block-size, max-model-len]
 
 VllmEncodeWorker:
@@ -39,5 +40,5 @@ VllmEncodeWorker:
   ServiceArgs:
     workers: 1
     resources:
-      gpu: 1
+      gpu: '1'
   common-configs: [model]
@@ -20,6 +20,7 @@ Common:
 
 Processor:
   router: round-robin
+  prompt-template: "USER: <image>\n<prompt> ASSISTANT:"
   common-configs: [model, block-size]
 
 VllmDecodeWorker:
@@ -30,15 +31,15 @@ VllmDecodeWorker:
   ServiceArgs:
     workers: 1
     resources:
-      gpu: 1
+      gpu: '1'
   common-configs: [model, block-size, max-model-len, kv-transfer-config]
 
 VllmPrefillWorker:
   max-num-batched-tokens: 16384
   ServiceArgs:
     workers: 1
     resources:
-      gpu: 1
+      gpu: '1'
   common-configs: [model, block-size, max-model-len, kv-transfer-config]
 
 VllmEncodeWorker:
@@ -47,5 +48,5 @@ VllmEncodeWorker:
   ServiceArgs:
     workers: 1
     resources:
-      gpu: 1
+      gpu: '1'
   common-configs: [model]
@@ -51,6 +51,12 @@ def parse_vllm_args(service_name, prefix) -> AsyncEngineArgs:
         default=3,
         help="Maximum queue size for remote prefill. If the prefill queue size is greater than this value, prefill phase of the incoming request will be executed locally.",
     )
+    parser.add_argument(
+        "--prompt-template",
+        type=str,
+        default="<prompt>",
+        help="Prompt template to use for the model",
+    )
     parser = AsyncEngineArgs.add_cli_args(parser)
     args = parser.parse_args(vllm_args)
     engine_args = AsyncEngineArgs.from_cli_args(args)
@@ -59,4 +65,5 @@ def parse_vllm_args(service_name, prefix) -> AsyncEngineArgs:
     engine_args.conditional_disagg = args.conditional_disagg
     engine_args.max_local_prefill_length = args.max_local_prefill_length
     engine_args.max_prefill_queue_size = args.max_prefill_queue_size
+    engine_args.prompt_template = args.prompt_template
     return engine_args