Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17912

This fixes issue #12836 where the server crashes with GGML_ASSERT failure when running with embeddings enabled and n_batch > n_ubatch.

The root cause is that embeddings require non-causal attention, which requires all tokens to be processed within a single ubatch. When n_batch > n_ubatch, the server attempts to split processing across multiple ubatches, causing an assertion failure:

GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch >= n_tokens_all)
&& "non-causal attention requires n_ubatch >= n_tokens") failed

Solution:

  • Add parameter validation after common_params_parse()
  • When embeddings are enabled and n_batch > n_ubatch:
    • Log warning messages explaining the issue
    • Automatically set n_batch = n_ubatch
    • Prevent server crash

This follows the approach suggested by @ggerganov in the issue.

Testing:

  • Server builds successfully
  • Parameter validation occurs before model loading
  • Warning messages inform users of the auto-correction
  • Server no longer crashes with the problematic configuration

Make sure to read the contributing guidelines before submitting a PR

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #514

Analysis Overview

This PR introduces parameter validation for embedding configurations to prevent server crashes when n_batch > n_ubatch. The change adds 9 lines of validation logic in tools/server/server.cpp after parameter parsing, automatically adjusting n_batch to match n_ubatch when embeddings are enabled.

Performance Impact

No measurable performance changes detected. All binaries show 0.0% power consumption variation, with the largest absolute change being 0.40 nJ in build.bin.llama-cvector-generator. Function-level analysis returned no data for response time or throughput changes, indicating the modifications do not affect runtime execution paths.

The validation occurs once at server startup, before model loading and HTTP server initialization. This one-time check introduces negligible overhead and does not impact inference performance. The parameter adjustment (when triggered) ensures embeddings process correctly within architectural constraints rather than causing crashes.

Inference Performance: No functions in the tokenization or inference pipeline (llama_decode, llama_encode, llama_tokenize) were modified. Token throughput remains unchanged as the validation only affects batch parameter configuration for embedding workloads, not the computational kernels or inference logic.

Power Consumption: All 16 analyzed binaries show zero measurable change, confirming no algorithmic modifications to performance-critical paths. The fix is purely defensive validation that enables previously-broken embedding configurations to function correctly.

@loci-dev loci-dev force-pushed the main branch 4 times, most recently from 1daebfe to 75a97fd Compare December 10, 2025 23:07
Fixes #12836 where the server crashes with GGML_ASSERT failure when
running with embeddings enabled and n_batch > n_ubatch.

Root cause: Embeddings use non-causal attention which requires all
tokens to be processed within a single ubatch. When n_batch > n_ubatch,
the server attempts to split processing, causing assertion failure.

Solution:
- Add parameter validation in main() after common_params_parse()
- When embeddings enabled and n_batch > n_ubatch:
  * Log warnings explaining the issue
  * Automatically set n_batch = n_ubatch
  * Prevent server crash

This follows the approach suggested by @ggerganov in issue #12836.

Note: This supersedes stalled PR #12940 which attempted a runtime fix
in the old examples/server/server.cpp location. This implementation
validates at startup in tools/server/server.cpp (current location).

Testing:
- Build: Compiles successfully
- Validation triggers: Warns when -b > -ub with --embedding
- Auto-correction works: Adjusts n_batch = n_ubatch
- No false positives: Valid params don't trigger warnings
- Verified on macOS M3 Pro with embedding model
@loci-dev loci-dev force-pushed the upstream-PR17912-branch_yifant-code-fix/embedding-batch-validation branch from aae2567 to 2722844 Compare December 10, 2025 23:35
@loci-dev loci-dev force-pushed the main branch 4 times, most recently from 78ff3d3 to 117bfc3 Compare December 11, 2025 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants