Name	Name	Last commit message	Last commit date
parent directory ..
custom_logit_processors	custom_logit_processors
README.md	README.md
client.py	client.py
pyproject.toml	pyproject.toml
serve_v3.sh	serve_v3.sh

Custom Logit Processors for NVIDIA Nemotron-3-Nano

A vLLM V1 custom logit processor for runtime thinking budget control on reasoning models like nvidia/NVIDIA-Nemotron-Nano-31B-A3-v3.

Overview

This package provides ThinkingBudgetLogitsProcessor - a logit processor that allows dynamic control over the "thinking" phase of reasoning models. When a model is in thinking mode (indicated by <think> tags), this processor can:

Enforce a maximum token budget for the thinking phase
Gracefully truncate thinking with customizable end tokens
Allow per-request budget overrides from the client

This is useful for balancing inference cost against reasoning depth, or enforcing latency constraints in production deployments.

Installation

pip install -e .

Server Setup

Start a vLLM server with the custom logit processor:

./serve_v3.sh

Or manually with environment configuration:

export THINKING_BUDGET_LOGITS_PROCESSOR_ARGS='{"thinking_budget": 500, "thinking_budget_grace_period": 50, "end_token_ids": [2259, 74045, 1062], "prompt_think_ids": [198, 27, 27963, 397], "end_think_ids": [[524, 27963, 397]]}'

vllm serve nvidia/NVIDIA-Nemotron-Nano-31B-A3-v3 \
    --port 8881 \
    --trust-remote-code \
    --logits-processors custom_logit_processors.v1.ThinkingBudgetLogitsProcessor

Configuration Options

Parameter	Description	Default
`thinking_budget`	Max tokens before attempting to end thinking. Set to `-1` for unlimited.	`-1`
`thinking_budget_grace_period`	Additional tokens allowed after budget to find a natural breakpoint (newline).	`-1`
`end_token_ids`	Token IDs to inject when truncating thinking (e.g., `</think>` tokens).	`[]`
`prompt_think_ids`	Token sequence indicating the model is in thinking mode (e.g., `\n<think>\n`).	`[]`
`end_think_ids`	Token sequences that indicate thinking has ended naturally.	`[]`

Recommended Values for Nemotron-Nano-v3

For nvidia/NVIDIA-Nemotron-Nano-31B-A3-v3, we recommend the following token ID configurations:

{
  "end_token_ids": [2259, 74045, 1062],
  "prompt_think_ids": [198, 27, 27963, 397],
  "end_think_ids": [[524, 27963, 397]]
}

These correspond to:

end_token_ids: </think> (injected when truncating)
prompt_think_ids: \n<think>\n (detects thinking mode)
end_think_ids: </think>\n (natural thinking termination)

Client Usage

Default Behavior (Server-Side Budget)

When no per-request overrides are provided, the server uses the parameters configured in the THINKING_BUDGET_LOGITS_PROCESSOR_ARGS environment variable at startup.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8881/v1",
    api_key="EMPTY"
)

result = client.chat.completions.create(
    model="model",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is 5.9 plus 6.1?"}
    ],
    temperature=1.0,
    max_tokens=12200,
)

# Parse thinking and answer
thinking_part, delim, answer_part = result.choices[0].message.content.partition("</think>")
print("Thinking:", thinking_part + delim)
print("Answer:", answer_part)

Custom Per-Request Budget

Override the server defaults for individual requests:

import json

# Custom truncation message tokens: "Reached thinking limit set by client\n\n</think>"
custom_think_truncation = [1871, 5565, 11483, 6139, 2016, 1536, 6934, 1338, 13]

result = client.chat.completions.create(
    model="model",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Solve this complex problem..."}
    ],
    temperature=1.0,
    max_tokens=12200,
    extra_body={
        "vllm_xargs": {
            "thinking_budget": 100,
            "thinking_budget_grace_period": 20,
            "end_token_ids": json.dumps(custom_think_truncation),
        }
    }
)

How It Works

Detection: When a request arrives, the processor checks if the prompt ends with prompt_think_ids (e.g., \n<think>\n), indicating the model should reason before answering.
Monitoring: As tokens are generated, the processor tracks output length and watches for natural thinking endings (end_think_ids).
Truncation: When thinking_budget is exceeded:
- If within the grace period, it waits for a newline token as a natural breakpoint
- Once triggered, it forces the end_token_ids sequence by setting all other logits to -inf
Completion: After injecting the end sequence, normal generation resumes for the answer portion.

File Structure

custom_logit_processors/
├── v1/
│   ├── __init__.py
│   └── nano_v3_logit_processors.py   # ThinkingBudgetLogitsProcessor
├── client.py                          # Example client usage
├── pyproject.toml
└── README.md

Token ID Reference

The NEWLINE_TOKENS set in the processor contains all token IDs that represent newline characters across the tokenizer vocabulary. These are used to find natural breakpoints when truncating thinking.

Common end sequences for Nemotron models:

[2259, 74045, 1062] → </think>
[1871, 5565, 11483, 6139, 1046, 2259, 74045, 1062] → Reached thinking limit. </think>

Requirements

vLLM >= 0.10.1 (with V1 engine support)
PyTorch
transformers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Custom Logit Processors for NVIDIA Nemotron-3-Nano

Overview

Installation

Server Setup

Configuration Options

Recommended Values for Nemotron-Nano-v3

Client Usage

Default Behavior (Server-Side Budget)

Custom Per-Request Budget

How It Works

File Structure

Token ID Reference

Requirements

FilesExpand file tree

budget

Directory actions

More options

Directory actions

More options

Latest commit

History

budget

Folders and files

parent directory

README.md

Custom Logit Processors for NVIDIA Nemotron-3-Nano

Overview

Installation

Server Setup

Configuration Options

Recommended Values for Nemotron-Nano-v3

Client Usage

Default Behavior (Server-Side Budget)

Custom Per-Request Budget

How It Works

File Structure

Token ID Reference

Requirements