Skip to content

Misc. bug: llama-server SSE error messages should be compatible with the RFC8895 specification #16104

@BenjaminBruenau

Description

@BenjaminBruenau

Name and Version

./llama-cli --version (llama.cpp compiled from source)
version: 6520 (4067f07)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

CUDA_VISIBLE_DEVICES="1" ~/llama.cpp/llama-server -m ~/dev/llms/Qwen3-4B-Instruct-2507-UD-Q8_K_XL.gguf \
-c 100 \
--temp 0.7 --top_k 20 --top_p 0.8 --min_p 0.01 \
--port 8083
# doesn't really matter as it is not a model or config specific problem

Problem description & steps to reproduce

I encountered this while working with the OpenAI compatible chat completion endpoint and stream=true (but also affects the /completion endpoint) and not getting any API errors thrown by the OpenAI python client library when exceeding the context size.
It seems like there is a small bug in the server implementation which leads to error messages not being emitted correctly using server sent events,
since the emitted event is not compatible with the RFC8895 spec.
The specification only accounts for data, id and retry fieldnames, while the implementation uses an error fieldname.

Additional Information

Incorrect (message is ignored and handled as empty):

error: {"code":400,"message":"the request exceeds the available context size. try increasing the context size or enable context shift","type":"invalid_request_error"}

data: [DONE]

Correct:

data: {"error":{"code":400,"message":"the request exceeds the available context size. try increasing the context size or enable context shift","type":"invalid_request_error"}}

data: [DONE]

When not streaming responses everything works as expected and a 400 with the following response body is returned:

{
    "error": {
        "code": 400,
        "message": "the request exceeds the available context size. try increasing the context size or enable context shift",
        "type": "invalid_request_error"
    }
}

server_sent_event(sink, "error", error_data);

When changing the line above to server_sent_event(sink, "data", json{{"error", error_data}}); error messages are emitted in accordance with the specification.

Decoding based on the SSE spec will now work correctly e.g. the way it is done in the openai client library (see https://github.com/openai/openai-python/blob/0d85ca08c83a408abf3f03b46189e6bf39f68ac6/src/openai/_streaming.py#L322)

I would open a PR for this but I am unsure about the following points:

  • is this maybe intended in any way? (the old webui used to parse the existing error message format specifically)
  • should this be something covered in the tests?
  • what about developers depending on the format emitted currently? This might break some existing error handling solutions when streaming responses

I guess this would also make the workaround used here

if (!hasReceivedData && fullResponse.length === 0) {

obsolete, since errors could then be handled more gracefully via the parsed SSE data @allozaur (Love the new UI btw!)

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions