server: fix crash when batch > ubatch with embeddings (#12836) #17912
+9
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #12836
Problem
Server crashes with
GGML_ASSERTfailure when running embeddings with-b > -ub:Embeddings use non-causal attention which requires all tokens in a single ubatch. When
n_batch > n_ubatch, the server attempts to split processing, triggering the assertion.Solution
Add parameter validation in
main()aftercommon_params_parse():--embeddingis enabled andn_batch > n_ubatchn_batch = n_ubatchto prevent crashFollows @ggerganov's suggested approach in #12836.
Testing
./llama-server -m model.gguf --embedding -b 2048 -ub 512Note
Supersedes stalled PR #12940 which attempted a runtime fix in the old
examples/server/location. This implementation validates at startup intools/server/(current location) per maintainer guidance.