UPSTREAM PR #17912: server: fix crash when batch > ubatch with embeddings (#12836) #514

loci-dev · 2025-12-10T16:42:51Z

This fixes issue #12836 where the server crashes with GGML_ASSERT failure when running with embeddings enabled and n_batch > n_ubatch.

The root cause is that embeddings require non-causal attention, which requires all tokens to be processed within a single ubatch. When n_batch > n_ubatch, the server attempts to split processing across multiple ubatches, causing an assertion failure:

GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch >= n_tokens_all)
&& "non-causal attention requires n_ubatch >= n_tokens") failed

Solution:

Add parameter validation after common_params_parse()
When embeddings are enabled and n_batch > n_ubatch:
- Log warning messages explaining the issue
- Automatically set n_batch = n_ubatch
- Prevent server crash

This follows the approach suggested by @ggerganov in the issue.

Testing:

Server builds successfully
Parameter validation occurs before model loading
Warning messages inform users of the auto-correction
Server no longer crashes with the problematic configuration

Make sure to read the contributing guidelines before submitting a PR

loci-agentic-ai · 2025-12-10T17:28:40Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #514

Analysis Overview

This PR introduces parameter validation for embedding configurations to prevent server crashes when n_batch > n_ubatch. The change adds 9 lines of validation logic in tools/server/server.cpp after parameter parsing, automatically adjusting n_batch to match n_ubatch when embeddings are enabled.

Performance Impact

No measurable performance changes detected. All binaries show 0.0% power consumption variation, with the largest absolute change being 0.40 nJ in build.bin.llama-cvector-generator. Function-level analysis returned no data for response time or throughput changes, indicating the modifications do not affect runtime execution paths.

The validation occurs once at server startup, before model loading and HTTP server initialization. This one-time check introduces negligible overhead and does not impact inference performance. The parameter adjustment (when triggered) ensures embeddings process correctly within architectural constraints rather than causing crashes.

Inference Performance: No functions in the tokenization or inference pipeline (llama_decode, llama_encode, llama_tokenize) were modified. Token throughput remains unchanged as the validation only affects batch parameter configuration for embedding workloads, not the computational kernels or inference logic.

Power Consumption: All 16 analyzed binaries show zero measurable change, confirming no algorithmic modifications to performance-critical paths. The fix is purely defensive validation that enables previously-broken embedding configurations to function correctly.

@ggerganov

Fixes #12836 where the server crashes with GGML_ASSERT failure when running with embeddings enabled and n_batch > n_ubatch. Root cause: Embeddings use non-causal attention which requires all tokens to be processed within a single ubatch. When n_batch > n_ubatch, the server attempts to split processing, causing assertion failure. Solution: - Add parameter validation in main() after common_params_parse() - When embeddings enabled and n_batch > n_ubatch: * Log warnings explaining the issue * Automatically set n_batch = n_ubatch * Prevent server crash This follows the approach suggested by @ggerganov in issue #12836. Note: This supersedes stalled PR #12940 which attempted a runtime fix in the old examples/server/server.cpp location. This implementation validates at startup in tools/server/server.cpp (current location). Testing: - Build: Compiles successfully - Validation triggers: Warns when -b > -ub with --embedding - Auto-correction works: Adjusts n_batch = n_ubatch - No false positives: Valid params don't trigger warnings - Verified on macOS M3 Pro with embedding model

loci-dev temporarily deployed to PROD__AL_DEMO December 10, 2025 16:42 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 4 times, most recently from 1daebfe to 75a97fd Compare December 10, 2025 23:07

loci-dev force-pushed the upstream-PR17912-branch_yifant-code-fix/embedding-batch-validation branch from aae2567 to 2722844 Compare December 10, 2025 23:35

loci-dev had a problem deploying to PROD__AL_DEMO December 10, 2025 23:35 — with GitHub Actions Failure

loci-dev force-pushed the main branch 4 times, most recently from 78ff3d3 to 117bfc3 Compare December 11, 2025 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17912: server: fix crash when batch > ubatch with embeddings (#12836) #514

UPSTREAM PR #17912: server: fix crash when batch > ubatch with embeddings (#12836) #514

Uh oh!

loci-dev commented Dec 10, 2025

Uh oh!

loci-agentic-ai bot commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

UPSTREAM PR #17912: server: fix crash when batch > ubatch with embeddings (#12836) #514

Are you sure you want to change the base?

UPSTREAM PR #17912: server: fix crash when batch > ubatch with embeddings (#12836) #514

Uh oh!

Conversation

loci-dev commented Dec 10, 2025

Uh oh!

loci-agentic-ai bot commented Dec 10, 2025

Performance Analysis Summary - PR #514

Analysis Overview

Performance Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants