Skip to content

Latest commit

 

History

History
152 lines (111 loc) · 4.67 KB

File metadata and controls

152 lines (111 loc) · 4.67 KB

How to Deploy Holo2 (4B, 8B, or 30B-A3B) Locally with vLLM on NVIDIA GPUs

Run a vLLM Server Locally

Requirements

  • An NVIDIA GPU with drivers installed
  • vLLM above 0.13 is required to use the holo2 reasoning parser

Installation

  1. Install vLLM using the instructions provided by vLLM

Example

You can launch vllm from the command line after installation.

vllm serve Hcompany/Holo2-4B \
    --dtype bfloat16 \
    --max-model-len=65536 \
    --reasoning-parser=holo2
    --limit-mm-per-prompt={"image": 3, "video": 0}

Deploy via Docker

Requirements

Example: Run Holo2 4B

docker run -it --gpus=all --rm -p 8000:8000 vllm/vllm-openai:v0.13.0 \
    --model Hcompany/Holo2-4B \
    --dtype bfloat16 \
    --max-model-len=65536 \
    --reasoning-parser=holo2
    --limit-mm-per-prompt={"image": 3, "video": 0}

Notes

  • To run Holo2 8B, change --model to Hcompany/Holo2-8B.
  • To run Holo2 {30B A3B|235B A22B}, change --model to Hcompany/Holo2-{30B-A3B|235B-A22B} and add --tensor-parallel-size {2|8}
  • There's a known performance degradation in the Qwen3-VL architecture since vLLM v0.11.2. To get the best localization performance, please install triton 3.6 on the vLLM image

Holo2 reasoning parser compatibility

Holo2 models are reasoning models. In order to extract reasoning content for a request, we need to set the --reasoning-parser accordingly in vllm (docker or vllm serve).

Holo2 chat template is configurable to enable and disable thinking. By default, Holo2 is in thinking mode. To configure thinking mode at the request level:

{"chat_template_kwargs": {"thinking": false }}

Invoking Holo2 via API

When vLLM is running, you can send requests to:

http://localhost:8000/v1/chat/completions

Test with curl: thinking mode

curl http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
        "model": "Hcompany/Holo2-4B",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

Test with curl: thinking mode disabled

curl http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
        "model": "Hcompany/Holo2-4B",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ],
        "chat_template_kwargs": {
            "thinking": false 
        }
    }'

Test with Python (OpenAI SDK)

  1. Install OpenAI client:
pip install openai
  1. Example Python script:
from openai import OpenAI

BASE_URL = "http://localhost:8000/v1"
API_KEY = "EMPTY"
MODEL = "Hcompany/Holo2-4B"

client = OpenAI(
    base_url=BASE_URL,
    api_key=API_KEY
)

# Thinking mode enabled by default
chat_completion = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
    ]
)

print(chat_completion.choices[0].message.content)

# Without thinking mode
chat_completion = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
    ],
    extra_body={"chat_template_kwargs": {"thinking": False }}
)

🔐 Note: The API key is not used by vLLM, but required by the OpenAI SDK — use "EMPTY" as a placeholder.

Notes

  • --model can be set to Hcompany/Holo2-4B, Hcompany/Holo2-8B or Hcompany/Holo2-30B-A3B
  • --gpus=all enables all NVIDIA GPUs for the container.
  • Holo2 is a multimodal model, so you can adjust image limits using --limit-mm-per-prompt.
  • Reduce --max-model-len or --gpu-memory-utilization if your GPU runs out of memory.
  • Ensure your GPU supports bfloat16 (e.g., H100, A100, L40S, RTX 4090, etc.), use float16 otherwise.
  • Port 8000 must be free; change it with -p <host>:8000 if needed.

Examples

When, the endpoint is in service, you can re-use our hosted api examples by replacing the base_url and model field with the proper values.