Skip to content

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Aug 5, 2025

gpt-oss model support in native MXFP4 format:

  • Compute graph implementation in llama.cpp
  • Attention sinks support in ggml
  • New MXFP4 data type in ggml
  • New ggml_add_id operator in ggml

Usage:

Model collection: https://huggingface.co/collections/ggml-org/gpt-oss-68923b60bee37414546c70bf

Example command:

llama-server -hf ggml-org/gpt-oss-120b-GGUF -c 0 -fa --jinja --reasoning-format none

# Then, access http://localhost:8080
Model card image

References:

Note to maintainers:

This an initial implementation with pretty much complete support for the CUDA, Vulkan, Metal and CPU backends. The idea is to merge this quicker than usual, in time for the official release today, and later we can work on polishing any potential problems and missing features.

Next PRs:

  • CUDA fattn-mma-f16 sinks (currently only the vec FA kernels are implemented)
  • Vulkan fattn sinks
  • Attention sinks and MXFP4 support for remaining backends
  • Improve chat template handling / tool calling / reasoning effort control / CoT / etc.

ngxson and others added 30 commits July 7, 2025 15:12
* llama : add attn sinks

* ggml : add attn sinks

* cuda : add attn sinks

* vulkan : add support for sinks in softmax

remove unnecessary return

* ggml : add fused swiglu_oai op (#11)

* ggml : add fused swiglu_oai op

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* update CUDA impl

* cont : metal impl

* add vulkan impl

* test-backend-ops : more test cases, clean up

* llama : remove unfused impl

* remove extra lines

---------

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: slaren <[email protected]>
@joseph777111
Copy link

@ggerganov Thank you, ggerganov and everyone else for your expedient and awesome work! Will attention sinks be made available to all GGUF'd models? 🤔

@CHNtentes
Copy link

Could anyone help explain why use -c 0?

@fat-tire
Copy link
Contributor

fat-tire commented Aug 6, 2025

@nachoal it appears there's a lot of new stuff here. At least to me-- but I have not used openai's API with openai before, only local models.

Like, there are two kinds of system prompts-- a "system message" and a "developer message". Also there are two types of tools-- "builtin_tools" (python or browser tools) referenced in the system message and function tools (described in the developer message). There is a special format for describing the function tools but I'm guessing MCP would work too.

The function tools are called in a separate "commentary" channel from normal reply content (and distinct from the "reasoning_content") per the harmony response format.

So different types of output appear in different places in the chat completion. As an example, instead of parsing <think></think> tags directly in the response as with some other models, you would find the reasoning content (in python) w:

reasoning_text = response.choices[0].message.reasoning_content

where response is a ChatCompletion object that came from the usual OpenAI API call:

response = client.chat.completions.create( ... )

It looks like right now in llama.cpp by default when an assistant tries to use a tool, the content string is left empty and it's the reasoning_content actually that contains the call stuff at the end of the reasoning text in the following format (this is just a reference MCP fetch tool call):

<|start|>assistant<|channel|>commentary to=fetch json<|message|>{\"url\":\"https://www.github.com\",\"max_length\":5000}

The expected format is is supposed to come after the reasoning like this:

<|start|>assistant<|channel|>commentary to=functions.get_weather <|constrain|>json<|message|>{"location":"San Francisco"}<|call|>

So the output looks very close but not exactly right from what I am seeing. It's missing the <|call|>

I'm sure that in the near future a tool call will be fully extracted by llama.cpp and put in response.choices[0].message.tool_calls or response.choices[0].message.function_call or wherever it's supposed to go, but as of right now it isn't recognizing the commentary channel at all.

The Harmony dox also discusses "Output by the model which can either be a tool call or a message output. " -- so apparently you can get a message OR a tool call, but apparently not both, which is why the content is blank when it tries to use a tool.

The hacky temporary workaround to this bug to maintain compatibility with other models would be to come up with a regex expression you could use to pull the json toolname and arguments/{output} from the reasoning_content text and substitute the resulting json as the reply text.

There's a note in this PR that the tool template stuff is a WIP and tool use is still to come, so I guess it may make the most sense to just wait for this to get fixed unless you're really itching to get tools working.

Anyone that knows more please correct me as I'm just figuring this out myself!

@nachoal
Copy link

nachoal commented Aug 6, 2025

@fat-tire Appreciate the complete explanation 🙏, I ended up just parsing the strings to try tool calling for now, a bit broken but it works. Thanks!

@nai-kon
Copy link

nai-kon commented Aug 6, 2025

Has the reasoning_effort not been implemented yet?

I'm hosting gpt-oss-20b on llama-server and calling it from the OpenAI API chat.completions.create().
I'm changing reasoning_effort of create(..., reasoning_effort="low"), but the actual system prompt remains "reasoning: medium" and not changed.
It seems the reasoning_effort option is not working.

Here is quick sample.

llama-server -m gpt-oss-20b-mxfp4.gguf -c 0 -fa --jinja --reasoning-format none -ngl 1000 (I'm using b6096 build)

from openai import OpenAI

model = OpenAI(api_key="dummy", base_url="http://127.0.0.1:8080")
completion = model.chat.completions.create(
    model="dummy",
    messages=[{"role": "user", "content": "Write fizzbuzz in Python"}],
    reasoning_effort="high",
)

print(completion)

@fat-tire
Copy link
Contributor

fat-tire commented Aug 6, 2025

@nachoal Yup a simple regex pattern on that reasoning_content string like:

pattern = r".*\<\|start\|\>assistant\<\|channel\|\>commentary to=(?P<toolname>\w+) json\<\|message\|\>(?P<output>.*)"

gets you two match groups, toolname & output which you can use to reassemble back to a MCP tool call (or whatever). It's not the best way to do it, but it works most of the time while we wait for the "right way". I've noticed that the toolname can also vary. Sometimes it is just "fetch" (and then the output are the fetch tool arguments) but sometimes it's just "functions" and then the output is the entire tool json including the tool name AND the arguments. I think it may be "function.fetch" at times as well. I guess the system prompt will need to more closely follow the suggested prompt examples for # tool and would ideally be in the "developer message" rather than the "system message" but I'll play around a bit more. Just adding this in case anyone else is doing the same.

@uazure
Copy link

uazure commented Aug 6, 2025

Using build: llama-b6098-bin-win-cuda-12.4-x64
./llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 0 -fa --reasoning-format none -dev none
it starts, but to any request it responds with GGGGGGGGGGGGGGGGGGGGGGG... sequence. I see similar issue mentioned few comments above. Perhaps they have common root

@slaren
Copy link
Member

slaren commented Aug 6, 2025

@uazure what CPU backend is it loading?

@createthis
Copy link
Contributor

createthis commented Aug 6, 2025

Has the reasoning_effort not been implemented yet?

@nai-kon I noticed the same behavior. I think we should open a defect issue rather than clog this further.

@Lyrcaxis Lyrcaxis mentioned this pull request Aug 6, 2025
9 tasks
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 7, 2025
ngxson added a commit to huggingface/huggingface.js that referenced this pull request Aug 7, 2025
Added in GPT-OSS PR ggml-org/llama.cpp#15091

---------

Co-authored-by: Xuan-Son Nguyen <[email protected]>
Comment on lines +8013 to +8014
new_name_gate = self.map_tensor_name(name.replace("gate_up_proj_scales", "gate_proj.weight"))
new_name_up = self.map_tensor_name(name.replace("gate_up_proj_scales", "up_proj.weight"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too late, but why was this split? Only adds extra ops on the graph...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gate_up tensor is organized in a way that a row of gate is followed by a row of up, aka interleaving. While we can rearrange it to the expected layout for fused op, I think it's easier to just split it into gate and up independently

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, didn't catch that.

Comment on lines +2845 to +2856
struct ggml_tensor * ggml_swiglu_oai(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
float alpha,
float limit) {
struct ggml_tensor * result = ggml_glu_impl(ctx, a, b, GGML_GLU_OP_SWIGLU_OAI, false);
ggml_set_op_params_f32(result, 2, alpha);
ggml_set_op_params_f32(result, 3, limit);

return result;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this is ggml_swiglu_oai_split.

@ericcurtin
Copy link
Collaborator

ericcurtin commented Aug 10, 2025

Does anybody see value in adding a simple chat client to upstream llama.cpp in C++ or python3 that we can consolidate on like this:

https://github.com/ericcurtin/lm-chat/blob/main/lm-chat.py

?

For formats like this new harmony one it can be hard to find simple reference implementations, that are not "from openai_harmony"

I guess there sort of is the html client implementation, but I'm not sure how many people are ready to crack that open as it's more that just a simple cli chat client.

@JohannesGaessler
Copy link
Collaborator

My opinion is that efforts should be focused on the existing web interface of the HTTP server.

@ericcurtin
Copy link
Collaborator

My opinion is that efforts should be focused on the existing web interface of the HTTP server.

A couple off issues with the web interface. It's a UI, so added complexity for a simple reference implementation. It's compressed, which kills all the version control, can't easily see changes, it's like a mystery blob that's committed from time to time.

I think it would be better if we committed both:

./tools/server/public/index.html.gz

and

./tools/server/public/index.html

on changes, at least we could track the changes then.

@ngxson
Copy link
Collaborator

ngxson commented Aug 10, 2025

index.html... hmm, good luck decoding the diff of the transpiled JS code

@ericcurtin
Copy link
Collaborator

index.html... hmm, good luck decoding the diff of the transpiled JS code

My bad, the true sources are there.

@CISC
Copy link
Collaborator

CISC commented Aug 14, 2025

@ggerganov This workflow has been stuck in the queue for over a week now, it's impossible to cancel: https://github.com/ggml-org/llama.cpp/actions/runs/16754489544

@ngxson
Copy link
Collaborator

ngxson commented Aug 14, 2025

@CISC it was created when github was down last week, should be a bug on github side

@CISC
Copy link
Collaborator

CISC commented Aug 14, 2025

Yeah, just wondering if it has anything to do with the abnormally long queue times we've been having since, or if it's something else.

@marvin-0042
Copy link

Amazing! Thanks for the great job!

Just curious, has anyone done any accuracy test for gpt-oss-20B on non-Nvidia platforms? Thanks!

@DivyanshScore1910
Copy link

@marvin-0042 here you go:
image
https://x.com/ggerganov/status/1958238492603089287

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs OpenCL Issues specific to the OpenCL backend python python script changes server SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: support gpt-oss?