-
Notifications
You must be signed in to change notification settings - Fork 13.2k
Description
Name and Version
./llama-cli --version (llama.cpp compiled from source)
version: 6520 (4067f07)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
CUDA_VISIBLE_DEVICES="1" ~/llama.cpp/llama-server -m ~/dev/llms/Qwen3-4B-Instruct-2507-UD-Q8_K_XL.gguf \
-c 100 \
--temp 0.7 --top_k 20 --top_p 0.8 --min_p 0.01 \
--port 8083
# doesn't really matter as it is not a model or config specific problem
Problem description & steps to reproduce
I encountered this while working with the OpenAI compatible chat completion endpoint and stream=true
(but also affects the /completion
endpoint) and not getting any API errors thrown by the OpenAI python client library when exceeding the context size.
It seems like there is a small bug in the server implementation which leads to error messages not being emitted correctly using server sent events,
since the emitted event is not compatible with the RFC8895 spec.
The specification only accounts for data
, id
and retry
fieldnames, while the implementation uses an error
fieldname.
Additional Information
Incorrect (message is ignored and handled as empty):
error: {"code":400,"message":"the request exceeds the available context size. try increasing the context size or enable context shift","type":"invalid_request_error"}
data: [DONE]
Correct:
data: {"error":{"code":400,"message":"the request exceeds the available context size. try increasing the context size or enable context shift","type":"invalid_request_error"}}
data: [DONE]
When not streaming responses everything works as expected and a 400 with the following response body is returned:
{
"error": {
"code": 400,
"message": "the request exceeds the available context size. try increasing the context size or enable context shift",
"type": "invalid_request_error"
}
}
llama.cpp/tools/server/server.cpp
Line 4692 in 4b8560a
server_sent_event(sink, "error", error_data); |
When changing the line above to server_sent_event(sink, "data", json{{"error", error_data}});
error messages are emitted in accordance with the specification.
Decoding based on the SSE spec will now work correctly e.g. the way it is done in the openai client library (see https://github.com/openai/openai-python/blob/0d85ca08c83a408abf3f03b46189e6bf39f68ac6/src/openai/_streaming.py#L322)
I would open a PR for this but I am unsure about the following points:
- is this maybe intended in any way? (the old webui used to parse the existing error message format specifically)
- should this be something covered in the tests?
- what about developers depending on the format emitted currently? This might break some existing error handling solutions when streaming responses
I guess this would also make the workaround used here
if (!hasReceivedData && fullResponse.length === 0) { |
obsolete, since errors could then be handled more gracefully via the parsed SSE data @allozaur (Love the new UI btw!)
First Bad Commit
No response