- An NVIDIA GPU with drivers installed
- vLLM above 0.13 is required to use the holo2 reasoning parser
- Install vLLM using the instructions provided by vLLM
You can launch vllm from the command line after installation.
vllm serve Hcompany/Holo2-4B \
--dtype bfloat16 \
--max-model-len=65536 \
--reasoning-parser=holo2
--limit-mm-per-prompt={"image": 3, "video": 0}
- An NVIDIA GPU with drivers installed
- NVIDIA Container Toolkit to allow Docker to access your GPU
- Docker installed and running
docker run -it --gpus=all --rm -p 8000:8000 vllm/vllm-openai:v0.13.0 \
--model Hcompany/Holo2-4B \
--dtype bfloat16 \
--max-model-len=65536 \
--reasoning-parser=holo2
--limit-mm-per-prompt={"image": 3, "video": 0}
- To run Holo2 8B, change --model to Hcompany/Holo2-8B.
- To run Holo2 {30B A3B|235B A22B}, change --model to Hcompany/Holo2-{30B-A3B|235B-A22B} and add --tensor-parallel-size {2|8}
- There's a known performance degradation in the Qwen3-VL architecture since vLLM v0.11.2. To get the best localization performance, please install triton 3.6 on the vLLM image
Holo2 models are reasoning models. In order to extract reasoning content for a request, we need to set the --reasoning-parser accordingly in vllm (docker or vllm serve).
Holo2 chat template is configurable to enable and disable thinking. By default, Holo2 is in thinking mode. To configure thinking mode at the request level:
{"chat_template_kwargs": {"thinking": false }}
When vLLM is running, you can send requests to:
http://localhost:8000/v1/chat/completions
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Hcompany/Holo2-4B",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]
}'
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Hcompany/Holo2-4B",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
],
"chat_template_kwargs": {
"thinking": false
}
}'
- Install OpenAI client:
pip install openai
- Example Python script:
from openai import OpenAI
BASE_URL = "http://localhost:8000/v1"
API_KEY = "EMPTY"
MODEL = "Hcompany/Holo2-4B"
client = OpenAI(
base_url=BASE_URL,
api_key=API_KEY
)
# Thinking mode enabled by default
chat_completion = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]
)
print(chat_completion.choices[0].message.content)
# Without thinking mode
chat_completion = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
],
extra_body={"chat_template_kwargs": {"thinking": False }}
)🔐 Note: The API key is not used by vLLM, but required by the OpenAI SDK — use "EMPTY" as a placeholder.
--modelcan be set toHcompany/Holo2-4B,Hcompany/Holo2-8BorHcompany/Holo2-30B-A3B--gpus=allenables all NVIDIA GPUs for the container.- Holo2 is a multimodal model, so you can adjust image limits using
--limit-mm-per-prompt. - Reduce
--max-model-lenor--gpu-memory-utilizationif your GPU runs out of memory. - Ensure your GPU supports bfloat16 (e.g., H100, A100, L40S, RTX 4090, etc.), use float16 otherwise.
- Port 8000 must be free; change it with
-p <host>:8000if needed.
When, the endpoint is in service, you can re-use our hosted api examples by replacing the base_url and model field with the proper values.