How to Deploy Holo2 (4B, 8B, or 30B-A3B) Locally with vLLM on NVIDIA GPUs

Run a vLLM Server Locally

Requirements

An NVIDIA GPU with drivers installed
vLLM above 0.13 is required to use the holo2 reasoning parser

Installation

Install vLLM using the instructions provided by vLLM

Example

You can launch vllm from the command line after installation.

vllm serve Hcompany/Holo2-4B \
    --dtype bfloat16 \
    --max-model-len=65536 \
    --reasoning-parser=holo2
    --limit-mm-per-prompt={"image": 3, "video": 0}

Deploy via Docker

Requirements

An NVIDIA GPU with drivers installed
NVIDIA Container Toolkit to allow Docker to access your GPU
Docker installed and running

Example: Run Holo2 4B

docker run -it --gpus=all --rm -p 8000:8000 vllm/vllm-openai:v0.13.0 \
    --model Hcompany/Holo2-4B \
    --dtype bfloat16 \
    --max-model-len=65536 \
    --reasoning-parser=holo2
    --limit-mm-per-prompt={"image": 3, "video": 0}

Notes

To run Holo2 8B, change --model to Hcompany/Holo2-8B.
To run Holo2 {30B A3B|235B A22B}, change --model to Hcompany/Holo2-{30B-A3B|235B-A22B} and add --tensor-parallel-size {2|8}
There's a known performance degradation in the Qwen3-VL architecture since vLLM v0.11.2. To get the best localization performance, please install triton 3.6 on the vLLM image

Holo2 reasoning parser compatibility

Holo2 models are reasoning models. In order to extract reasoning content for a request, we need to set the --reasoning-parser accordingly in vllm (docker or vllm serve).

Holo2 chat template is configurable to enable and disable thinking. By default, Holo2 is in thinking mode. To configure thinking mode at the request level:

{"chat_template_kwargs": {"thinking": false }}

Invoking Holo2 via API

When vLLM is running, you can send requests to:

http://localhost:8000/v1/chat/completions

Test with curl: thinking mode

curl http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
        "model": "Hcompany/Holo2-4B",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

Test with curl: thinking mode disabled

curl http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
        "model": "Hcompany/Holo2-4B",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ],
        "chat_template_kwargs": {
            "thinking": false 
        }
    }'

Test with Python (OpenAI SDK)

Install OpenAI client:

pip install openai

Example Python script:

from openai import OpenAI

BASE_URL = "http://localhost:8000/v1"
API_KEY = "EMPTY"
MODEL = "Hcompany/Holo2-4B"

client = OpenAI(
    base_url=BASE_URL,
    api_key=API_KEY
)

# Thinking mode enabled by default
chat_completion = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
    ]
)

print(chat_completion.choices[0].message.content)

# Without thinking mode
chat_completion = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
    ],
    extra_body={"chat_template_kwargs": {"thinking": False }}
)

🔐 Note: The API key is not used by vLLM, but required by the OpenAI SDK — use "EMPTY" as a placeholder.

Notes

--model can be set to Hcompany/Holo2-4B, Hcompany/Holo2-8B or Hcompany/Holo2-30B-A3B
--gpus=all enables all NVIDIA GPUs for the container.
Holo2 is a multimodal model, so you can adjust image limits using --limit-mm-per-prompt.
Reduce --max-model-len or --gpu-memory-utilization if your GPU runs out of memory.
Ensure your GPU supports bfloat16 (e.g., H100, A100, L40S, RTX 4090, etc.), use float16 otherwise.
Port 8000 must be free; change it with -p <host>:8000 if needed.

Examples

When, the endpoint is in service, you can re-use our hosted api examples by replacing the base_url and model field with the proper values.

Holo2 Localization with HAI API

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Deploy Holo2 (4B, 8B, or 30B-A3B) Locally with vLLM on NVIDIA GPUs

Run a vLLM Server Locally

Requirements

Installation

Example

Deploy via Docker

Requirements

Example: Run Holo2 4B

Notes

Holo2 reasoning parser compatibility

Invoking Holo2 via API

Test with curl: thinking mode

Test with curl: thinking mode disabled

Test with Python (OpenAI SDK)

Notes

Examples

FilesExpand file tree

deploy_with_vllm.md

Latest commit

History

deploy_with_vllm.md

File metadata and controls

How to Deploy Holo2 (4B, 8B, or 30B-A3B) Locally with vLLM on NVIDIA GPUs

Run a vLLM Server Locally

Requirements

Installation

Example

Deploy via Docker

Requirements

Example: Run Holo2 4B

Notes

Holo2 reasoning parser compatibility

Invoking Holo2 via API

Test with curl: thinking mode

Test with curl: thinking mode disabled

Test with Python (OpenAI SDK)

Notes

Examples